HamburgerMenu
hirist

Job Description

Responsibilities :


- 7+ years of experience in SRE, DevOps, or infrastructure management.

- Lead the NOC/SRE team from the front, ensuring a culture of proactive monitoring, rapid response, and continuous improvement.

- Act as the primary escalation point for major incidents, providing technical guidance and decision-making.

- Collaborate with DevOps, Engineering, and Product teams to enhance system reliability.

- Define best practices, incident response protocols, and runbooks for the team.

- Lead log tracing and deep troubleshooting for infrastructure, network, and application issues.

- Reduce MTTR (Mean Time to Resolution) and improve incident management processes.

- Expertise in troubleshooting complex infrastructure and application issues.

- Strong knowledge of log tracing, distributed tracing, and observability tools (e.g., ELK, Splunk, Grafana, Prometheus, OpenTelemetry).

- Deep understanding of SLAs, SLOs, and error budgets.

- Experience with cloud platforms (AWS, Azure, GCP) and container orchestration (Kubernetes, Docker).

- Good knowledge of Terraform, Kubernetes, Docker, and cloud architectures.

- Proficiency in monitoring and observability tools (New Relic, Prometheus, Datadog, etc.).

- Understanding of CI/CD pipelines, automation, and infrastructure as code (IaC).

- Basic scripting skills in Python, Go, Shell, or similar.

- Strong troubleshooting skills for complex distributed systems.

- Ability to mentor junior engineers and drive SRE best practices.

- Willingness to primarily work during 3:30 PM to 3:30 AM IST, with flexibility to adjust shifts as needed based on operational requirements.

- Strong problem-solving skills and ability to work in a fast-paced environment.

- Strong incident management, troubleshooting, and RCA skills.

Qualifications :


- 6+ years of experience in Site Reliability Engineering (SRE) / NOC / DevOps roles.

- Proven leadership experience, managing or mentoring a team.

- Hands-on experience with Terraform for Infrastructure as Code (IaC).

- Experience in Python for automation and scripting.

- Expertise in troubleshooting complex infrastructure and application issues.

- Strong knowledge of log tracing, distributed tracing, and observability tools (e.g., ELK, Splunk, Grafana, Prometheus, OpenTelemetry).

- Deep understanding of SLAs, SLOs, and error budgets.

- Experience with cloud platforms (AWS, Azure, GCP) and container orchestration (Kubernetes, Docker).

- Familiarity with CI/CD pipelines and GitOps practices.

- Strong problem-solving skills and the ability to make quick, data-driven decisions under pressure.


info-icon

Did you find something suspicious?