Posted on: 06/12/2025
Description :
Job Title : Senior Specialist Cloud SRE
Education : Bachelors Degree
Experience : 8+ years
Location : Mumbai
As a Senior SRE Engineer (Cloud SRE Specialist), you will be responsible for ensuring the reliability, scalability, performance, and cost optimization of cloud services across AWS, Azure, and multi-cloud environments.
You will act as the primary technical lead for assigned customers, manage incident escalations, drive automation-first practices, and mentor junior engineers.
You will also collaborate closely with development teams to embed resilience and observability into applications.
Key Responsibilities :
Customer Leadership & Collaboration :
- Serve as the primary technical point of contact for assigned customer accounts.
- Provide regular updates and lead initiatives to improve customer environments.
- Be highly familiar with assigned accounts to make tactical decisions without escalation.
- Collaborate with customer development teams to align infrastructure with application requirements.
Incident & Problem Management :
- Lead incident response and postmortems, ensuring corrective and preventive measures.
- Be the Tier 3 escalation point for offshore/onshore SRE teams.
- Perform Root Cause Analysis (RCA) and validate work quality of Tier-2 engineers.
- Develop and maintain incident response plans for security breaches and operational incidents.
Reliability Engineering :
- Define and maintain SLIs/SLOs, track error budgets, and monitor alignment.
- Participate in architecture discussions for high availability, disaster recovery, and scalability.
- Integrate resilience patterns such as circuit breakers, retries, and bulkheading.
- Use chaos engineering / fault injection practices where applicable.
Automation & Infrastructure as Code :
- Automate infrastructure and operations tasks using Terraform, CloudFormation, AWS CDK.
- Build and maintain CI/CD pipelines with canary deployments and blue/green strategies.
- Implement automation workflows with AWS Lambda, Step Functions, Azure Functions.
Monitoring & Observability :
- Implement observability systems : Prometheus, Grafana, OpenTelemetry, ELK, Jaeger.
- Configure proactive monitoring and alerts using AWS CloudWatch / Azure Monitor.
- Ensure visibility into metrics, traces, and logs for troubleshooting.
Cloud Infrastructure Management :
- Provision and manage VMs, storage, networking, VPNs, and ExpressRoute/Peering.
- Manage patching, backups, encryption, decryption, and image management.
- Optimize performance and cost via rightsizing, autoscaling, and reserved instances.
- Manage identity and access controls (AWS IAM, Azure AD, RBAC).
Security & Compliance :
- Implement and enforce security best practices across multi-cloud environments.
- Ensure compliance with GDPR, HIPAA, and industry regulations.
- Conduct regular audits and compliance reporting.
Mentoring & Knowledge Sharing :
- Coach and mentor Tier 2 and junior SREs.
- Conduct reliability-focused design reviews.
- Maintain up-to-date documentation, runbooks, and SOPs.
Did you find something suspicious?
Posted By
Posted in
DevOps / SRE
Functional Area
Site Reliability Engineering
Job Code
1585594
Interview Questions for you
View All