Posted on: 25/09/2025
About the Role :
This position requires deep expertise in Google Cloud Platform (GCP), SRE principles, and DevOps best practices. You will be responsible for designing and implementing infrastructure, improving observability, and maintaining SLAs for services running at scale.
Key Responsibilities :
- Implement SRE best practices: monitoring, alerting, SLAs, SLOs, and error budgets.
- Automate operational tasks using Infrastructure as Code (IaC) tools like Terraform.
- Improve system reliability and reduce manual interventions through automation.
- Collaborate with development teams to ensure new services are production-ready.
- Incident response and post-mortem analysis to prevent recurring issues.
- Design and implement CI/CD pipelines for rapid and safe deployments.
- Manage GCP resources: IAM, VPC, Compute Engine, GKE, Cloud Functions, Pub/Sub, BigQuery, etc.
- Ensure security, compliance, and cost optimization on the cloud infrastructure.
Required Skills & Qualifications :
- Strong hands-on experience with Google Cloud Platform (GCP) services.
- Proficiency with Terraform or other IaC tools.
- Solid knowledge of Kubernetes (GKE), containerization, and microservices.
- Strong scripting skills in Python, Go, or Shell.
- Familiarity with incident response and post-mortem culture.
- Knowledge of networking, security, and cloud cost management.
Preferred Qualifications
- Prior experience working with e-commerce or high-scale platforms.
- Familiarity with SRE tooling like Chaos Engineering, Service Mesh (Istio), etc.
Soft Skills :
- Problem-solving mindset with a focus on reliability and automation.
- Ability to work independently in a distributed, outsourced team model.
Did you find something suspicious?
Posted By
Posted in
DevOps / SRE
Functional Area
Site Reliability Engineering
Job Code
1551612
Interview Questions for you
View All