The Senior Site Reliability Engineering (SRE) Lead is a crucial, expert-level technical position requiring 15+ years of progressive experience in SRE, DevOps, and high-scale application support.

This role is central to ensuring the optimal performance, scalability, and reliability of critical applications by establishing and enforcing SRE best practices across engineering domains.

The incumbent is responsible for leading SRE adoption, architecting automated solutions, serving as an Incident Commander, and providing technical mentorship to SRE engineers.

Job Summary :

We are seeking a Principal-level Senior SRE Lead (15+ years experience) with mandatory expertise in designing and maintaining ultra-reliable, large-scale distributed systems. The ideal candidate will be technically proficient in cloud automation (Terraform/Ansible), scripting (Python/Bash), and advanced observability platforms (Prometheus, Grafana, ELK Stack). Key responsibilities include defining and enforcing strict Service Level Objectives (SLOs) and Error Budgets, leading complex incident response via a structured Incident Command System, eliminating toil through proactive automation, and driving cultural transformation toward infrastructure-as-code and reliability engineering principles.

Key Responsibilities and Technical Deliverables :

SRE Adoption & Observability Engineering :

- Lead SRE Adoption by guiding development and infrastructure teams on best practices for availability, latency, and performance optimization.
- Define, measure, and enforce strict Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Error Budgets for key user journeys and application services.

- Monitor System Performance by architecting and maintaining the advanced observability stack (e.g., Prometheus, Grafana, distributed tracing) to ensure system reliability and proactive risk identification.

- Implement advanced alerting methodologies (e.g., golden signals) to detect anomalous behavior and minimize false positives.

Automation, Toil Reduction, and Tooling :

- Drive Automation & Process Improvement by eliminating repetitive manual work (toil) through extensive scripting in Python and Bash.

- Utilize Infrastructure-as-Code (IaC) tools like Terraform or Ansible to manage cloud infrastructure components and ensure environment consistency and immutability.

- Design and implement self-healing mechanisms and automated remediation procedures to enhance system resilience.

Incident Management and Root Cause Analysis (RCA) :

- Lead efforts during critical outages, serving as the Incident Commander to coordinate rapid response, mitigation, and recovery.

- Conduct rigorous Root Cause Analysis (RCA) using 5 Whys and timeline analysis, translating findings into prioritized engineering action items to prevent recurrence (blameless postmortems).

- Develop and implement automated responses to common incident patterns, reducing Mean Time To Detect (MTTD) and Mean Time To Recovery (MTTR).

Collaboration and Mentorship :

- Collaborate closely with cross-functional teams (Software Development, QA, Product) to influence architecture and improve application performance from a reliability perspective.

- Mentorship junior SRE engineers on technical skills, incident response protocol, and reliability principles, fostering a strong culture of accountability and continuous learning.

- Apply Agile experience to integrate SRE practices seamlessly into the continuous delivery lifecycle and development sprints.

Mandatory Skills & Qualifications :

- Experience : 15+ years of experience in SRE, DevOps, and high-scale application support.

- Technical Proficiency : Expert in SRE concepts (SLOs, Error Budgets) and advanced monitoring tools (Prometheus, Grafana, ELK Stack, Distributed Tracing).

- Automation Expertise : Mastery of Python and Bash scripting and mandatory hands-on experience with Terraform/Ansible for IaC.

- Problem-solving : Demonstrated ability to analyze application performance metrics (latency, throughput, utilization) and troubleshoot distributed systems issues under pressure.

- Methodology : Deep knowledge of Agile practices and proven ability to lead and mentor teams on SRE principles.

Preferred Skills :

- Certification in a major cloud platform (AWS/Azure/GCP) or Kubernetes (CKA/CKS).

- Direct experience with performance testing and chaos engineering principles.

- Experience integrating security compliance and vulnerability scanning into CI/CD pipelines.

- Demonstrated ability to conduct formal, blameless postmortems.