HamburgerMenu
hirist

Senior Site Reliability Engineer - IAC Terraform

EMBARK
Bangalore
6 - 9 Years

Posted on: 29/08/2025

Job Description

Job Description :


Key Responsibilities :


SRE & Application Reliability :


- Implement and tune SLOs/SLIs, build reliability dashboards, and respond to incidents using Grafana IRM, JSM, and escalation workflows.

- Monitor application performance and availability across Kubernetes clusters using Grafana, Prometheus, Loki, Mimir, and Tempo.

- Participate in on-call rotation, postmortems, and continual improvement processes.

Application Support & Troubleshooting :


- Act as the primary escalation point for production issues whether internal or client-facing.

- Monitor logs, traces, and alerts to proactively identify and resolve incidents.

- Debug issues across the stack: Kubernetes, Helm releases, application logs, API errors, database bottlenecks.

- Coordinate with development, QA, and client teams to ensure timely and effective resolution of issues.

DevOps & Infrastructure Automation :


- Implement GitOps workflows using FluxCD and ArgoCD to manage Kubernetes deployments.

- Manage and maintain infrastructure-as-code using Terraform, Terragrunt, and Azure (Preferred).

- Automate CI/CD pipelines with GitHub Actions for Docker image builds, Helm-based deployments, release tagging, etc.

Post-QA & Release Validation :


- Work closely with QA engineers to validate release branches, tag images, and verify integration across services.

- Test application functionality post deployments (sanity and product functional tests).

- Assist in defining performance benchmarks (e.g., pgBench for PostgreSQL clusters) and validate pre-

production readiness.

Must-Have Qualifications :


- 6- 8 years of experience in DevOps, SRE, or Production Support roles.

- Strong hands-on experience with Azure and Kubernetes (AKS preferred) and Helm/Kustomize.

- Solid knowledge of GitHub Actions, GitOps (FluxCD/ArgoCD), and Terraform/Terragrunt.

- Experience with monitoring/logging stacks : Grafana, Prometheus, Loki, Tempo, Mimir, and Incident Response tools.

- Experience debugging microservices written in Node.js, Go, or similar.

- Excellent troubleshooting and debugging skills across the stack.


The job is for:

Women candidates preferred
info-icon

Did you find something suspicious?