Posted on: 03/02/2026
Description :
Key Responsibilities :
- Design, implement, and maintain highly available, scalable, and resilient systems
- Lead and execute Chaos Engineering experiments to identify system weaknesses and improve reliability
- Define and track SLIs, SLOs, and SLAs across critical services
- Automate infrastructure provisioning and operations using Infrastructure as Code (IaC)
- Build and maintain monitoring, alerting, and observability solutions
- Perform incident management, root cause analysis (RCA), and postmortems
- Collaborate with development teams to improve system reliability and production readiness
- Optimize system performance, cost, and capacity planning
- Ensure best practices for security, compliance, and disaster recovery
Required Skills & Qualifications :
- 5+ years of experience as a Site Reliability Engineer / DevOps Engineer / Platform Engineer
- Strong hands-on experience with Chaos Engineering tools (e.g., Chaos Monkey, Gremlin, LitmusChaos, Chaos Mesh)
- Solid experience with Linux/Unix systems and networking fundamentals
- Strong programming/scripting skills in Python, Go, or Shell
- Hands-on experience with cloud platforms (AWS / Azure / GCP)
- Experience with containerization and orchestration (Docker, Kubernetes)
- Proficiency in CI/CD pipelines and automation tools
- Strong experience with monitoring and observability tools (Prometheus, Grafana, ELK, Datadog, New Relic)
- Understanding of distributed systems and microservices architecture
Good to Have :
- Experience with service mesh (Istio, Linkerd)
- Knowledge of Terraform, Ansible, or Pulumi
- Exposure to resilience patterns (circuit breakers, bulkheads, rate limiting)
- Experience with multi-region / multi-cloud architectures
- Certifications like CKA, CKAD, AWS/GCP/Azure certifications
Education :
- Bachelors degree in Computer Science, Engineering, or a related field
Did you find something suspicious?
Posted by
Posted in
DevOps / SRE
Functional Area
Site Reliability Engineering
Job Code
1609109