Posted on: 19/11/2025
Description :
Site Reliability Engineer (SRE) - Azure/AKS Lead
Role Overview :
This is a senior technical leadership role for a Site Reliability Engineer (SRE) requiring 10+ years of experience, focused on owning and driving reliability for mission-critical, high-scale services deployed on Microsoft Azure.
The role demands prior experience as a DevOps Engineer transitioning into a dedicated SRE function. The incumbent must possess expert knowledge in Azure, AKS (Azure Kubernetes Service), and modern reliability practices including defining and enforcing SLIs/SLOs.
Based in Trivandrum, this SRE will shape technical standards, lead major incident response, and champion engineering excellence across multiple development teams.
Job Summary :
We are seeking an experienced SRE Lead (10+ years) with strong background in Azure and AKS to ensure the highest levels of availability, performance, and scalability for our Tier-0/Tier-1 services.
This role is responsible for establishing and maintaining core SRE practices, including defining error budgets, implementing multi-burn-rate alerting, driving continuous automation (Terraform/GitOps), and leading critical incident response with calm clarity. Expertise in observability, disaster recovery design (RTO/RPO), and cluster hardening is mandatory.
Key Responsibilities and Reliability Engineering Deliverables :
- Service Level Management: Define SLIs/SLOs for Tier-0/Tier-1 services and conduct quarterly reviews. Implement multi-window, multi-burn-rate alerts to precisely detect evolving service degradation.
- Error Budgeting and Change Gating: Enforce reliability constraints by implementing Change gating via CI/CD based on error budgets (using tools like Azure DevOps/GitHub Actions). Conduct weekly SLO reviews & drive the reliability roadmap.
- Incident Management Command: Lead SEV1/SEV2 incidents as the Incident Commander, taking ownership of rapid resolution, clear communication & postmortems. Ensure all corrective actions are implemented effectively.
- Reliability Architecture & Kubernetes: Design and implement robust reliability patterns including DR (Disaster Recovery), multi-AZ/region configurations, HPA/VPA/KEDA for optimized scaling, and resilient deployment strategies like canary, blue-green, and rollback.
- Cluster Hardening & Optimization: Drive Cluster hardening initiatives (network, identity, policy). Optimize resource utilization and service density. Manage ingress traffic using AGIC / Nginx.
- Observability Implementation: Implement comprehensive observability solutions utilizing Metrics, traces, and logs via Azure Monitor, App Insights, Log Analytics, Prometheus, Grafana, and OpenTelemetry. Ensure Alerts on symptoms, not noise.
- Automation and Infrastructure as Code (IaC): Automate platform provisioning using Terraform / Bicep. Implement GitOps (Flux/Argo) principles for deployment management and enforce compliance using Azure Policy/OPA Gatekeeper. Automate toil & build self-service runbooks/chatops.
- Performance & Capacity Planning: Conduct rigorous Load testing. Optimize platform autoscaling strategies and collaborate with FinOps to optimize cloud cost.
- Disaster Recovery and Testing: Define RTO/RPO objectives. Ensure compliance by executing regular chaos drills & game days to validate resilience.
- Security and Governance: Implement Security best practices leveraging Entra ID (Azure AD), Key Vault rotation, VNets/NSGs, and driving shift-left security practices within the CI pipeline.
Mandatory Skills & Qualifications:
- Experience: 10+ years of professional experience in Site Reliability or DevOps. Must have previously worked as a DevOps engineer and at present working as SRE.
- Cloud Platform: Strong experience in Azure.
- Container Orchestration: Strong experience with AKS (Azure Kubernetes Service) and Experience working in docker.
- Database: Experience working on PostgreSQL (or similar enterprise-grade databases).
- Observability: Strong experience with observability practices and tools (e.g., Azure Monitor, Grafana, Prometheus, App Insights).
- IaC & Automation: Hands-on expertise with Terraform / Bicep and GitOps principles.
Preferred Skills :
- Deep familiarity with Entra ID, Azure Policy, and Key Vault security integration.
- Experience implementing OpenTelemetry standards for distributed tracing.
- Certifications related to Azure or Kubernetes (e.g., Azure Administrator, CKA/CKAD).
Did you find something suspicious?
Posted By
Posted in
DevOps / SRE
Functional Area
Site Reliability Engineering
Job Code
1577668
Interview Questions for you
View All