Description :

Site Reliability Engineer (SRE) - Azure/AKS Lead

Role Overview :

This is a senior technical leadership role for a Site Reliability Engineer (SRE) requiring 10+ years of experience, focused on owning and driving reliability for mission-critical, high-scale services deployed on Microsoft Azure.

The role demands prior experience as a DevOps Engineer transitioning into a dedicated SRE function. The incumbent must possess expert knowledge in Azure, AKS (Azure Kubernetes Service), and modern reliability practices including defining and enforcing SLIs/SLOs.

Based in Trivandrum, this SRE will shape technical standards, lead major incident response, and champion engineering excellence across multiple development teams.

Job Summary :

We are seeking an experienced SRE Lead (10+ years) with strong background in Azure and AKS to ensure the highest levels of availability, performance, and scalability for our Tier-0/Tier-1 services.

This role is responsible for establishing and maintaining core SRE practices, including defining error budgets, implementing multi-burn-rate alerting, driving continuous automation (Terraform/GitOps), and leading critical incident response with calm clarity. Expertise in observability, disaster recovery design (RTO/RPO), and cluster hardening is mandatory.

Key Responsibilities and Reliability Engineering Deliverables :

- Service Level Management: Define SLIs/SLOs for Tier-0/Tier-1 services and conduct quarterly reviews. Implement multi-window, multi-burn-rate alerts to precisely detect evolving service degradation.

- Error Budgeting and Change Gating: Enforce reliability constraints by implementing Change gating via CI/CD based on error budgets (using tools like Azure DevOps/GitHub Actions). Conduct weekly SLO reviews & drive the reliability roadmap.

- Incident Management Command: Lead SEV1/SEV2 incidents as the Incident Commander, taking ownership of rapid resolution, clear communication & postmortems. Ensure all corrective actions are implemented effectively.

- Reliability Architecture & Kubernetes: Design and implement robust reliability patterns including DR (Disaster Recovery), multi-AZ/region configurations, HPA/VPA/KEDA for optimized scaling, and resilient deployment strategies like canary, blue-green, and rollback.

- Cluster Hardening & Optimization: Drive Cluster hardening initiatives (network, identity, policy). Optimize resource utilization and service density. Manage ingress traffic using AGIC / Nginx.

- Observability Implementation: Implement comprehensive observability solutions utilizing Metrics, traces, and logs via Azure Monitor, App Insights, Log Analytics, Prometheus, Grafana, and OpenTelemetry. Ensure Alerts on symptoms, not noise.

- Automation and Infrastructure as Code (IaC): Automate platform provisioning using Terraform / Bicep. Implement GitOps (Flux/Argo) principles for deployment management and enforce compliance using Azure Policy/OPA Gatekeeper. Automate toil & build self-service runbooks/chatops.

- Performance & Capacity Planning: Conduct rigorous Load testing. Optimize platform autoscaling strategies and collaborate with FinOps to optimize cloud cost.

- Disaster Recovery and Testing: Define RTO/RPO objectives. Ensure compliance by executing regular chaos drills & game days to validate resilience.

- Security and Governance: Implement Security best practices leveraging Entra ID (Azure AD), Key Vault rotation, VNets/NSGs, and driving shift-left security practices within the CI pipeline.

Mandatory Skills & Qualifications:

- Experience: 10+ years of professional experience in Site Reliability or DevOps. Must have previously worked as a DevOps engineer and at present working as SRE.

- Cloud Platform: Strong experience in Azure.

- Container Orchestration: Strong experience with AKS (Azure Kubernetes Service) and Experience working in docker.

- Database: Experience working on PostgreSQL (or similar enterprise-grade databases).

- Observability: Strong experience with observability practices and tools (e.g., Azure Monitor, Grafana, Prometheus, App Insights).

- IaC & Automation: Hands-on expertise with Terraform / Bicep and GitOps principles.

Preferred Skills :

- Deep familiarity with Entra ID, Azure Policy, and Key Vault security integration.

- Experience implementing OpenTelemetry standards for distributed tracing.

- Certifications related to Azure or Kubernetes (e.g., Azure Administrator, CKA/CKAD).