Description :

About the Role :

We are seeking a highly skilled Site Reliability Engineer (SRE) to design, build, and optimize large-scale, distributed, and high-availability systems.

The ideal candidate will bring strong expertise in cloud-native infrastructure, automation, performance engineering, and reliability best practices to ensure seamless availability and scalable operations across mission-critical platforms.

The SRE will collaborate closely with engineering, DevOps, platform, and security teams to drive reliability-focused engineering and operational excellence.

Key Responsibilities :

Reliability & Performance :

- Design, implement, and maintain highly available, fault-tolerant, and scalable infrastructure across cloud and hybrid environments.

- Establish and optimize SLIs, SLOs, and error budgets, ensuring systems meet performance and stability targets.

- Perform proactive capacity planning, performance tuning, and root-cause analysis for complex production issues.

Automation & Scalability :

- Develop automation frameworks using Python, Go, or Shell to eliminate manual tasks and improve operational efficiency.

- Automate deployment pipelines, monitoring alerts, and health checks to optimize system reliability and resilience.

- Build self-healing mechanisms and automated rollback strategies for mission-critical environments.

Cloud & Infrastructure Engineering :

- Architect, deploy, and maintain infrastructure on AWS, Azure, or GCP using IaC tools like Terraform, CloudFormation, or Pulumi.

- Manage containerized workloads using Kubernetes, Docker, and cloud-native orchestration services.

- Implement scalable networking, storage, and compute configurations aligned with best practices.

Monitoring, Observability & Incident Management :

- Design and maintain advanced monitoring, logging, and tracing solutions using tools like Prometheus, Grafana, ELK/EFK stack, OpenTelemetry, Datadog, or Splunk.

- Lead incident response, postmortems, on-call rotations, and continuous improvement of incident management processes.

- Implement anomaly detection, predictive alerting, and observability dashboards.

Security & Compliance :

- Work with security teams to embed DevSecOps practices in CI/CD workflows.

- Implement infrastructure hardening, secrets management, policy enforcement, and compliance reporting (SOC2, ISO27001, etc.

- Ensure systems are protected against vulnerabilities through regular patching and proactive risk assessment.

Required Skills & Experience :

- Strong experience in Linux/Unix systems engineering, networking fundamentals, and distributed systems.

- Hands-on expertise with cloud platforms (AWS/Azure/GCP).

- Advanced scripting/programming skills in Python, Go, Shell, or Ruby.

- Deep understanding of Kubernetes, Containers, Helm, and Service Mesh (Istio/Linkerd).

- Experience implementing IaC (Terraform preferred).

- Strong knowledge of CI/CD pipelines (GitLab CI, Jenkins, GitHub Actions, ArgoCD, Spinnaker).

- Proven experience in building robust monitoring and observability frameworks.

- Solid understanding of system reliability, performance tuning, scalability strategies, and incident management