Posted on: 10/03/2026
Important Note :
- We are considering only local candidates for this requirement.
- Candidates must be available for face-to-face interviews on short notice.
Job Overview :
We are looking for a Senior Site Reliability Engineer (SRE) with strong expertise in observability, cloud-native platforms, and Kubernetes-based systems. This is a hands-on role focused on building, operating, and improving reliable, scalable, and observable platforms in GCP (preferred) and AWS environments.
Key Responsibilities :
Reliability & Operations :
- Design and maintain highly available, resilient systems on Kubernetes
- Define and manage SLOs, SLIs, and error budgets
- Lead incident response, perform RCA, and drive blameless postmortems
- Improve platform reliability through automation and tooling
Observability (Core Focus) :
- Build and operate centralized observability platforms (metrics, logs, traces, alerts)
- Hands-on with Prometheus, Alertmanager, Grafana
- Logging & tracing using ELK / OpenSearch, Loki, OpenTelemetry
- Implement cloud-native monitoring (GCP Cloud Monitoring & Logging preferred)
- Define actionable and noise-free alerting standards
Cloud & Platform Engineering :
- Build and manage infrastructure on GCP (preferred) or AWS
- Deploy services using Helm
- Manage containerized workloads with Docker
- Use Terraform / Ansible / Packer for infrastructure automation
Automation & Tooling :
- Strong Python skills for automation and reliability tooling
- Build internal tools for observability, SLO tracking, and incident workflows
- Integrate CI/CD pipelines (Jenkins) with reliability and observability checks
Collaboration & Leadership :
- Mentor junior engineers
- Influence architecture and reliability best practices
- Collaborate closely with platform, application, and cloud teams
Mandatory Skills :
- Site Reliability Engineering (SRE)
- Python ( Coding ) not just scripting
- ELK stack
- Kubernetes
- AWS and/or GCP
- Prometheus, Grafana
- Docker, Helm
- Terraform
- Linux
- CI/CD (Jenkins)
Nice to Have :
- Splunk, Datadog, Cribl, Vectors
- OpenTelemetry
- Multi-cloud experience
- Platform security exposure
Project Highlights :
- Build and operate a centralized observability platform
- Lead production incident response
- Optimize scalability, performance, and cloud costs
Act as a technical leader for SRE & observability initiatives
Did you find something suspicious?
Posted by
Recruiter
HR at Saarthee Technology Pvt Ltd
Last Active: NA as recruiter has posted this job through third party tool.
Posted in
DevOps / SRE
Functional Area
Site Reliability Engineering
Job Code
1619386