Posted on: 13/11/2025
Responsibilities :
- Lead the NOC/SRE team from the front, ensuring a culture of proactive monitoring, rapid response, and continuous improvement.
- Act as the primary escalation point for major incidents, providing technical guidance and decision-making.
- Collaborate with DevOps, Engineering, and Product teams to enhance system reliability.
- Define best practices, incident response protocols, and runbooks for the team.
- Lead log tracing and deep troubleshooting for infrastructure, network, and application issues.
- Reduce MTTR (Mean Time to Resolution) and improve incident management processes.
- Expertise in troubleshooting complex infrastructure and application issues.
- Strong knowledge of log tracing, distributed tracing, and observability tools (e.g., ELK, Splunk, Grafana, Prometheus, OpenTelemetry).
- Deep understanding of SLAs, SLOs, and error budgets.
- Experience with cloud platforms (AWS, Azure, GCP) and container orchestration (Kubernetes, Docker).
- Good knowledge of Terraform, Kubernetes, Docker, and cloud architectures.
- Proficiency in monitoring and observability tools (New Relic, Prometheus, Datadog, etc.).
- Understanding of CI/CD pipelines, automation, and infrastructure as code (IaC).
- Basic scripting skills in Python, Go, Shell, or similar.
- Strong troubleshooting skills for complex distributed systems.
- Ability to mentor junior engineers and drive SRE best practices.
- Willingness to primarily work during 3:30 PM to 3:30 AM IST, with flexibility to adjust shifts as needed based on operational requirements.
- Strong problem-solving skills and ability to work in a fast-paced environment.
- Strong incident management, troubleshooting, and RCA skills.
Qualifications :
- Proven leadership experience, managing or mentoring a team.
- Hands-on experience with Terraform for Infrastructure as Code (IaC).
- Experience in Python for automation and scripting.
- Expertise in troubleshooting complex infrastructure and application issues.
- Strong knowledge of log tracing, distributed tracing, and observability tools (e.g., ELK, Splunk, Grafana, Prometheus, OpenTelemetry).
- Deep understanding of SLAs, SLOs, and error budgets.
- Experience with cloud platforms (AWS, Azure, GCP) and container orchestration (Kubernetes, Docker).
- Familiarity with CI/CD pipelines and GitOps practices.
- Strong problem-solving skills and the ability to make quick, data-driven decisions under pressure.
Did you find something suspicious?
Posted By
Posted in
DevOps / SRE
Functional Area
Site Reliability Engineering
Job Code
1573579
Interview Questions for you
View All