HamburgerMenu
hirist

Saarthee - Senior Site Reliability Engineer

Saarthee Technology Pvt Ltd
7 - 12 Years
Bangalore

Posted on: 10/03/2026

Job Description

Important Note :

- We are considering only local candidates for this requirement.


- Candidates must be available for face-to-face interviews on short notice.

Job Overview :

We are looking for a Senior Site Reliability Engineer (SRE) with strong expertise in observability, cloud-native platforms, and Kubernetes-based systems. This is a hands-on role focused on building, operating, and improving reliable, scalable, and observable platforms in GCP (preferred) and AWS environments.

Key Responsibilities :

Reliability & Operations :

- Design and maintain highly available, resilient systems on Kubernetes


- Define and manage SLOs, SLIs, and error budgets

- Lead incident response, perform RCA, and drive blameless postmortems

- Improve platform reliability through automation and tooling

Observability (Core Focus) :

- Build and operate centralized observability platforms (metrics, logs, traces, alerts)


- Hands-on with Prometheus, Alertmanager, Grafana

- Logging & tracing using ELK / OpenSearch, Loki, OpenTelemetry

- Implement cloud-native monitoring (GCP Cloud Monitoring & Logging preferred)

- Define actionable and noise-free alerting standards

Cloud & Platform Engineering :

- Build and manage infrastructure on GCP (preferred) or AWS


- Operate Kubernetes clusters (GKE preferred, EKS acceptable)

- Deploy services using Helm

- Manage containerized workloads with Docker

- Use Terraform / Ansible / Packer for infrastructure automation

Automation & Tooling :

- Strong Python skills for automation and reliability tooling


- Build internal tools for observability, SLO tracking, and incident workflows

- Integrate CI/CD pipelines (Jenkins) with reliability and observability checks

Collaboration & Leadership :

- Mentor junior engineers


- Influence architecture and reliability best practices

- Collaborate closely with platform, application, and cloud teams

Mandatory Skills :

- Site Reliability Engineering (SRE)

- Python ( Coding ) not just scripting

- ELK stack

- Kubernetes

- AWS and/or GCP

- Prometheus, Grafana

- Docker, Helm

- Terraform

- Linux

- CI/CD (Jenkins)

Nice to Have :

- Splunk, Datadog, Cribl, Vectors


- OpenTelemetry

- Multi-cloud experience

- Platform security exposure

Project Highlights :

- Build and operate a centralized observability platform


- Drive SLOs and error budgets to reduce MTTR

- Lead production incident response

- Optimize scalability, performance, and cloud costs

Act as a technical leader for SRE & observability initiatives


info-icon

Did you find something suspicious?

Similar jobs that you might be interested in