Posted on: 03/12/2025
Description :
About the Role :
We are seeking a highly skilled Site Reliability Engineer (SRE) to design, build, and optimize large-scale, distributed, and high-availability systems.
The ideal candidate will bring strong expertise in cloud-native infrastructure, automation, performance engineering, and reliability best practices to ensure seamless availability and scalable operations across mission-critical platforms.
The SRE will collaborate closely with engineering, DevOps, platform, and security teams to drive reliability-focused engineering and operational excellence.
Key Responsibilities :
Reliability & Performance :
- Design, implement, and maintain highly available, fault-tolerant, and scalable infrastructure across cloud and hybrid environments.
- Establish and optimize SLIs, SLOs, and error budgets, ensuring systems meet performance and stability targets.
- Perform proactive capacity planning, performance tuning, and root-cause analysis for complex production issues.
Automation & Scalability :
- Develop automation frameworks using Python, Go, or Shell to eliminate manual tasks and improve operational efficiency.
- Automate deployment pipelines, monitoring alerts, and health checks to optimize system reliability and resilience.
- Build self-healing mechanisms and automated rollback strategies for mission-critical environments.
Cloud & Infrastructure Engineering :
- Architect, deploy, and maintain infrastructure on AWS, Azure, or GCP using IaC tools like Terraform, CloudFormation, or Pulumi.
- Manage containerized workloads using Kubernetes, Docker, and cloud-native orchestration services.
- Implement scalable networking, storage, and compute configurations aligned with best practices.
Monitoring, Observability & Incident Management :
- Design and maintain advanced monitoring, logging, and tracing solutions using tools like Prometheus, Grafana, ELK/EFK stack, OpenTelemetry, Datadog, or Splunk.
- Lead incident response, postmortems, on-call rotations, and continuous improvement of incident management processes.
- Implement anomaly detection, predictive alerting, and observability dashboards.
Security & Compliance :
- Work with security teams to embed DevSecOps practices in CI/CD workflows.
- Implement infrastructure hardening, secrets management, policy enforcement, and compliance reporting (SOC2, ISO27001, etc.
- Ensure systems are protected against vulnerabilities through regular patching and proactive risk assessment.
Required Skills & Experience :
- Strong experience in Linux/Unix systems engineering, networking fundamentals, and distributed systems.
- Hands-on expertise with cloud platforms (AWS/Azure/GCP).
- Advanced scripting/programming skills in Python, Go, Shell, or Ruby.
- Deep understanding of Kubernetes, Containers, Helm, and Service Mesh (Istio/Linkerd).
- Experience implementing IaC (Terraform preferred).
- Strong knowledge of CI/CD pipelines (GitLab CI, Jenkins, GitHub Actions, ArgoCD, Spinnaker).
- Proven experience in building robust monitoring and observability frameworks.
- Solid understanding of system reliability, performance tuning, scalability strategies, and incident management
Did you find something suspicious?
Posted By
Posted in
DevOps / SRE
Functional Area
Site Reliability Engineering
Job Code
1584244
Interview Questions for you
View All