Posted on: 08/04/2026
Description :
- Manage and optimize Kubernetes clusters (EKS or self-managed)
- Define and enforce SLIs, SLOs, and error budgets
- Build and maintain Infrastructure as Code (IaC) using Terraform
- Develop and manage CI/CD pipelines using GitHub Actions
- Automate infrastructure and operational workflows using Python
- Improve system reliability, latency, and performance
- Implement observability solutions (metrics, logs, traces)
- Lead incident response, root cause analysis (RCA), and postmortems
- Reduce toil through automation and continuous improvement
- Ensure security, compliance, and cost efficiency of infrastructure
Required Skills & Qualifications :
- Deep understanding of Kubernetes (cluster operations, scaling, networking)
- Experience with Terraform for infrastructure provisioning
- Proficiency in Docker and container ecosystems
- Hands-on experience with GitHub Actions (CI/CD pipelines)
- Strong scripting skills in Python for automation
- Experience with monitoring tools like Prometheus, Grafana, Splunk
- Solid understanding of Linux systems, networking, and distributed systems
- Experience with incident management and on-call processes
Core SRE Focus Areas :
- Monitoring, Alerting & Observability
- Incident Management & Postmortems
- Capacity Planning & Performance Tuning
- Infrastructure Automation & Self-Healing Systems
Did you find something suspicious?
Posted by
Posted in
DevOps / SRE
Functional Area
Site Reliability Engineering
Job Code
1627008