Description :

Role Overview :

We are looking for an experienced SRE Lead to drive reliability, scalability, and operational excellence across our platforms. The ideal candidate will combine strong engineering skills with an SRE mindset, leading incident response, defining reliability metrics, and partnering with development teams to build highly available and resilient systems through automation and best practices.

Key Responsibilities :

- Lead incident management and response, including root cause analysis (RCA), post-incident reviews, and continuous improvement actions.

- Define, implement, and govern Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets to balance reliability and feature velocity.

- Drive toil reduction by identifying repetitive operational tasks and implementing automation and self-healing solutions.

- Design, implement, and maintain observability platforms, including monitoring, logging, alerting, and tracing frameworks using tools such as Prometheus, Grafana, and related ecosystems.

- Collaborate closely with application development, platform, and architecture teams to embed reliability, scalability, and performance into system design from early stages.

- Support and enhance CI/CD pipelines to ensure safe, automated, and reliable deployments.

- Implement and manage Infrastructure as Code (IaC) using tools such as Terraform, ensuring consistency and repeatability across environments.

- Provide technical leadership, mentoring, and guidance to SRE and DevOps engineers.

- Establish and enforce operational best practices, including capacity planning, change management, and disaster recovery strategies.

Required Skills & Experience :

- 5-8 years of hands-on experience in Site Reliability Engineering, DevOps, or Platform Engineering roles.

- Strong hands-on expertise in monitoring, logging, and alerting systems, including building and operating observability stacks using Prometheus, Grafana, and similar tools.

- Deep experience with Kubernetes (production-grade clusters) and containerized workloads.

- Solid experience designing and maintaining CI/CD pipelines using tools such as Azure DevOps, Git, and related ecosystems.

- Strong proficiency in Infrastructure as Code, particularly with Terraform, for managing cloud and platform infrastructure.

- Hands-on experience with Linux system administration and troubleshooting in production environments.

- Working knowledge of configuration management and automation tools such as Puppet and/or Ansible.

- Strong understanding of cloud-native architectures, scalability, resilience, and security best practices.

- Excellent problem-solving skills with the ability to remain calm and decisive during high-severity incidents.

- Strong communication and stakeholder management skills, with the ability to influence engineering teams toward reliability-first practices.

Nice to Have :

- Experience with cloud platforms (Azure, AWS, or GCP).

- Exposure to distributed systems design and microservices architectures.

- Experience implementing self-healing systems and advanced alerting strategies.

- SRE or cloud certifications (CKA, CKAD, Azure DevOps Expert, etc.).