Posted on: 29/08/2025
Job Description :
Key Responsibilities :
- Implement and tune SLOs/SLIs, build reliability dashboards, and respond to incidents using Grafana IRM, JSM, and escalation workflows.
- Monitor application performance and availability across Kubernetes clusters using Grafana, Prometheus, Loki, Mimir, and Tempo.
- Participate in on-call rotation, postmortems, and continual improvement processes.
Application Support & Troubleshooting :
- Act as the primary escalation point for production issues whether internal or client-facing.
- Monitor logs, traces, and alerts to proactively identify and resolve incidents.
- Debug issues across the stack: Kubernetes, Helm releases, application logs, API errors, database bottlenecks.
- Coordinate with development, QA, and client teams to ensure timely and effective resolution of issues.
DevOps & Infrastructure Automation :
- Manage and maintain infrastructure-as-code using Terraform, Terragrunt, and Azure (Preferred).
- Automate CI/CD pipelines with GitHub Actions for Docker image builds, Helm-based deployments, release tagging, etc.
Post-QA & Release Validation :
- Work closely with QA engineers to validate release branches, tag images, and verify integration across services.
- Test application functionality post deployments (sanity and product functional tests).
- Assist in defining performance benchmarks (e.g., pgBench for PostgreSQL clusters) and validate pre-
production readiness.
Must-Have Qualifications :
- Strong hands-on experience with Azure and Kubernetes (AKS preferred) and Helm/Kustomize.
- Solid knowledge of GitHub Actions, GitOps (FluxCD/ArgoCD), and Terraform/Terragrunt.
- Experience with monitoring/logging stacks : Grafana, Prometheus, Loki, Tempo, Mimir, and Incident Response tools.
- Experience debugging microservices written in Node.js, Go, or similar.
- Excellent troubleshooting and debugging skills across the stack.
The job is for:
Did you find something suspicious?
Posted By
Posted in
DevOps / SRE
Functional Area
Site Reliability Engineering
Job Code
1537420
Interview Questions for you
View All