Posted on: 27/11/2025
Description :
Role : Senior Site Reliability Engineering (SRE) Lead
Role Overview :
The Senior Site Reliability Engineering (SRE) Lead is a crucial, expert-level technical position requiring 15+ years of progressive experience in SRE, DevOps, and high-scale application support.
This role is central to ensuring the optimal performance, scalability, and reliability of critical applications by establishing and enforcing SRE best practices across engineering domains.
The incumbent is responsible for leading SRE adoption, architecting automated solutions, serving as an Incident Commander, and providing technical mentorship to SRE engineers.
Job Summary :
We are seeking a Principal-level Senior SRE Lead (15+ years experience) with mandatory expertise in designing and maintaining ultra-reliable, large-scale distributed systems. The ideal candidate will be technically proficient in cloud automation (Terraform/Ansible), scripting (Python/Bash), and advanced observability platforms (Prometheus, Grafana, ELK Stack). Key responsibilities include defining and enforcing strict Service Level Objectives (SLOs) and Error Budgets, leading complex incident response via a structured Incident Command System, eliminating toil through proactive automation, and driving cultural transformation toward infrastructure-as-code and reliability engineering principles.
Key Responsibilities and Technical Deliverables :
SRE Adoption & Observability Engineering :
- Lead SRE Adoption by guiding development and infrastructure teams on best practices for availability, latency, and performance optimization.
- Define, measure, and enforce strict Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Error Budgets for key user journeys and application services.
- Experience : 15+ years of experience in SRE, DevOps, and high-scale application support.
- Technical Proficiency : Expert in SRE concepts (SLOs, Error Budgets) and advanced monitoring tools (Prometheus, Grafana, ELK Stack, Distributed Tracing).
- Automation Expertise : Mastery of Python and Bash scripting and mandatory hands-on experience with Terraform/Ansible for IaC.
- Problem-solving : Demonstrated ability to analyze application performance metrics (latency, throughput, utilization) and troubleshoot distributed systems issues under pressure.
- Methodology : Deep knowledge of Agile practices and proven ability to lead and mentor teams on SRE principles.
Preferred Skills :
- Certification in a major cloud platform (AWS/Azure/GCP) or Kubernetes (CKA/CKS).
- Direct experience with performance testing and chaos engineering principles.
- Experience integrating security compliance and vulnerability scanning into CI/CD pipelines.
- Demonstrated ability to conduct formal, blameless postmortems.
Did you find something suspicious?
Posted By
Mrinmoyee Roy Chowdhury
Talent Acquisition Lead at CAPITALNUMBERS INFOTECH LIMITED
Last Active: 28 Nov 2025
Posted in
DevOps / SRE
Functional Area
DevOps / Cloud
Job Code
1581510
Interview Questions for you
View All