HamburgerMenu
hirist

Saarthee - Senior Site Reliability Engineer

Saarthee Technology Pvt Ltd
Multiple Locations
6 - 10 Years

Posted on: 19/01/2026

Job Description

Description :

Position Summary :

We are looking for a Senior Site Reliability Engineer (SRE) with deep expertise in observability, cloud-native infrastructure, and large-scale distributed systems.

This role is highly hands-on and focuses on designing, building, and operating reliable, observable, and scalable platforms running on Kubernetes, with a strong preference for Google Cloud Platform (GCP) and AWS.

Job Responsibilities :

- Design, implement, and operate highly available and resilient Kubernetes-based systems.

- Define, monitor, and enforce SLIs, SLOs, and error budgets to ensure service reliability.

- Lead incident response, root cause analysis (RCA), and postmortems, driving continuous improvement.

- Architect and manage observability platforms for metrics, logging, tracing, and alerting.

- Work hands-on with Prometheus, Alertmanager, OpenTelemetry, Grafana, and Loki / ELK / OpenSearch.

- Implement cloud-native monitoring and logging, with preference for GCP Cloud Monitoring & Logging.

- Establish actionable alerting standards to reduce noise and improve response effectiveness.

- Build and manage cloud infrastructure on GCP (preferred) or AWS.

- Operate and scale Kubernetes clusters (GKE preferred) and deploy services using Helm.

- Manage containerized workloads using Docker.

- Develop automation and internal tooling using Python to improve reliability and observability.

- Integrate CI/CD pipelines with reliability and monitoring checks.

- Mentor junior engineers, influence architectural decisions, and collaborate across engineering teams.

Required Skills and Qualifications :

- 6+ years of experience as a DevOps Engineer, SRE, or related software engineering role, supporting production-grade systems.

- Strong hands-on experience with cloud infrastructure on GCP (preferred) or AWS.

- Proven expertise in operating Kubernetes-based platforms in production environments (GKE preferred).

- Solid experience designing and maintaining highly available and resilient systems using SRE best practices.

- Hands-on knowledge of SLIs, SLOs, error budgets, and reliability engineering principles.

- Strong experience with observability and monitoring tools, including Prometheus, Grafana, Alertmanager, OpenTelemetry, and log platforms such as Loki / ELK / OpenSearch.

- Demonstrated experience in incident management, on-call support, root cause analysis, and postmortems.

- Proficiency in automation and tooling using Python, with additional scripting experience in Shell or Groovy.

- Experience integrating CI/CD pipelines (Jenkins, GitHub) with deployment, monitoring, and reliability checks.

- Strong understanding of microservices architectures, distributed systems, and containerized workloads.

- Hands-on experience with Infrastructure as Code (IaC) tools such as Terraform or CloudFormation.

- Good knowledge of cloud networking, security fundamentals, and access controls.

- Strong analytical and problem-solving skills with a proactive operational mindset.

- Excellent communication skills and the ability to collaborate effectively with cross-functional engineering teams.


info-icon

Did you find something suspicious?

Similar jobs that you might be interested in