HamburgerMenu
hirist

Site Reliability Engineer - IAC Terraform

iSprout - Managed Office Space
Kolkata
4 - 7 Years

Posted on: 29/01/2026

Job Description

Description :


- Ensuring high availability, performance, and scalability of cloud infrastructure through proactive monitoring, automation, and continuous improvement.

- Designing and maintaining resilient Azure-based infrastructure using IaC (Terraform).

- Implementing end-to-end observability with telemetry, CUJ-level metrics, dashboards, alerts, and real-time performance insights.

- Monitoring Critical User Journeys with product and business teams to maintain a reliable user experience.

- Conducting load testing, capacity planning, and performance tuning to prepare systems for traffic growth and spikes.

- Managing SLIs, SLOs, SLAs, and error budgets across critical services.

- Implementing next-generation cloud reliability and fault-tolerance solutions, including disaster recovery improvements.

- Identifying risks and preventing service disruptions through proactive reliability engineering.

- Automating deployments, scaling, failover, and remediation to reduce manual toil and operational bottlenecks.

- Leading incident response, participating in on-call rotations, conducting root cause analysis, and delivering blameless post-mortems.

- Creating and maintaining runbooks, documentation, and operational guidelines.

- Collaborating with engineering and global teams on reliability best practices; mentoring junior SREs and supporting SRE hiring.

Must-haves :


- Experience as an SRE in cloud and infrastructure teams for 6+ years.

- Extensive experience with Microsoft Azure cloud services and infrastructure management for a minimum of 5+ years.

- Strong technical background with solid knowledge of software development principles, application production support, SDLC best practices, and Agile methodology.

- Hands-on SRE experience with a strong understanding SLOs, SLIs, error budgets, incident management, and conducting blameless post-mortems.

- Strong ability to analyze and understand application architectures and identify areas for improvement.

- Experience working with monitoring, logging, and observability tools to assess and improve application performance.

- Proficiency in scripting and automation tools, including Python, Bash, and Terraform, to reduce toil and enhance operational efficiency.

- Strong incident response and troubleshooting skills with the ability to perform effective root cause analysis.

- Excellent communication and collaboration skills for working with cross-functional teams and clearly explaining technical concepts.

- Ability to coach and mentor team members in SRE practices and foster a culture of reliability.

- Practical experience applying Agile development practices and working in Agile teams.

- Proactive mindset focused on continuous improvement to increase system reliability and performance


info-icon

Did you find something suspicious?

Similar jobs that you might be interested in