Posted on: 27/11/2025
Job Title : Lead SRE
Job Description :
- Team Leadership : Manage and mentor a team of SREs, assigning tasks, providing technical guidance, and fostering a culture of collaboration and continuous learning.
- Design and Implement Monitoring and Alerting : Lead the implementation of reliable, scalable, and fault-tolerant systems, including infrastructure, monitoring, alerting.
- Incident Management : Manage incident response processes, including root cause analysis, post-mortem reviews, and proactive mitigation strategies to minimize system downtime and impact.
- Monitoring & Alerting : Develop and maintain comprehensive monitoring systems to identify potential issues early, set appropriate alerting thresholds, and optimize system performance.
- Automation & Tooling : Drive automation initiatives to streamline operational tasks, including deployments, scaling, and configuration management, utilizing relevant tools and technologies.
- Capacity Planning : Proactively assess system capacity needs, plan for future growth, and implement scaling strategies to ensure optimal performance under load.
- Performance Optimization : Analyze system metrics and identify bottlenecks, implement performance improvements, and optimize resource utilization.
- Collaboration : Work closely with development teams, product managers, and other stakeholders to ensure alignment on reliability goals and smooth integration of new features.
- Technical Strategy : Develop and implement the SRE roadmap, including technology adoption, standards, and best practices to maintain a high level of system reliability.
Technical Skills and Experience :
- Technical Expertise : Strong proficiency in system administration, cloud computing (AWS, Azure), networking, distributed systems, containerization technologies (Docker, Kubernetes).
- Programming Skills : Expertise in scripting languages (Python, Bash) and ability to develop automation tools.
Good to have basic understanding of Java
- Monitoring & Alerting : Deep understanding of monitoring systems (Prometheus, Grafana), alerting configurations, and log analysis.
- Incident Management : Proven experience in managing critical incidents, performing root cause analysis, and coordinating response efforts.
- Leadership & Communication : Excellent communication skills to convey technical concepts to both technical and non-technical audiences, ability to lead and motivate a team.
- Problem-Solving : Strong analytical and troubleshooting skills to identify and resolve complex technical issues.
Experience Range : 7 - 10 years
Educational Qualifications : B.Tech/B.E
Skills Required : DevOps , AWS , Docker , Kubernetes , Grafana
Did you find something suspicious?
Posted By
Posted in
DevOps / SRE
Functional Area
Site Reliability Engineering
Job Code
1581440
Interview Questions for you
View All