Posted on: 17/12/2025
Role Overview :
We are seeking a Senior Site Reliability Engineer with strong experience in building and maintaining scalable, resilient systems.
The ideal candidate will have hands-on expertise in cloud-native technologies, infrastructure as code, observability, and automation, with a focus on Google Cloud Platform (GCP).
Key Responsibilities :
- Ensure the stability and reliability of cloud-native applications deployed on GCP, containerized with Docker and orchestrated via Kubernetes.
- Define, implement, and monitor SLOs, SLAs, and SLIs to measure system performance and user experience.
- Automate infrastructure provisioning using Terraform and manage Kubernetes configurations with Kustomize and Helm.
- Develop and maintain monitoring and alerting systems using Datadog and GCP-native tools.
- Conduct incident analysis and postmortems to drive continuous improvement.
- Collaborate with development teams to integrate reliability practices into CI/CD pipelines using GitHub Actions.
- Manage and troubleshoot database systems, particularly PostgreSQL and Cassandra.
- Apply networking knowledge and Linux system administration skills to troubleshoot and optimize system connectivity and performance.
Qualifications :
Education :
Bachelors or Masters degree in Computer Science, Software Engineering, or equivalent practical experience.
Work Experience & Skills :
- 5+ years of experience in Site Reliability Engineering.
- Proven experience designing and operating elastic, resilient systems in cloud environments.
- Strong understanding of GCP, Kubernetes, and container orchestration.
- Proficiency in infrastructure as code and configuration management tools (Terraform, Helm, Kustomize).
- Experience with monitoring and observability tools (Datadog, GCP Monitoring).
- Solid scripting skills in bash and familiarity with automation frameworks.
- Experience with CI/CD pipelines, especially using GitHub Actions.
- Familiarity with networking fundamentals and troubleshooting.
- Strong coding skills and ability to develop reliability-focused tooling.
- Excellent communication skills in English (written and spoken).
Other Requirements :
- Strong problem-solving skills and a process-oriented mindset.
- Ability to work independently and collaboratively in a fast-paced environment.
- Passion for clean code, automation, and continuous improvement.
Nice-to-Have :
- Familiarity with monitoring tools (e.g., DataDog, Prometheus, GCP Monitoring).
- Experience working in Agile/Scrum teams.
Did you find something suspicious?
Posted by
Shikhar gupta
Senior Consultant - Global Talent Management at METRO Business Solution Center
Last Active: 18 Dec 2025
Posted in
DevOps / SRE
Functional Area
Site Reliability Engineering
Job Code
1592256
Interview Questions for you
View All