Posted on: 29/01/2026
Responsibilities :
- Ensure high availability, reliability, and performance of production systems across multiple geographies.
- Design, deploy, and maintain Kubernetes clusters and containerized workloads using Docker and Helm charts.
- Build and manage CI/CD pipelines using Jenkins and Bitbucket.
- Operate and monitor distributed systems using Kafka and RabbitMQ.
- Manage cloud infrastructure on AWS and GCP, including load balancers, autoscaling, and networking.
- Implement observability : monitoring, logging, alerting, and incident response (SLIs, SLOs, SLAs).
- Perform capacity planning, disaster recovery, and failover testing.
- Strengthen network and application security, including IAM, secrets management, and vulnerability remediation.
- Collaborate with backend, frontend and mobile teams (Reactjs, Node.js, Android, iOS) to improve system reliability.
- Lead post-incident reviews and drive continuous reliability improvements through automation.
Requirements :
- Strong experience with Kubernetes, Docker, Helm in production environments.
- Hands-on with AWS and/or GCP networking and load balancing.
- Experience operating message brokers (Kafka, RabbitMQ) at scale.
- Proficiency in CI/CD automation (Jenkins, Bitbucket Pipelines).
- Solid understanding of Linux, TCP/IP, DNS, firewalls, and zero-trust principles.
- Scripting skills in Bash, Python, or Node.js.
- Experience supporting global, latency-sensitive applications.
- Strong incident management and root cause analysis skills.
- Experience with Chaos engineering and resilience testing
Good to Have :
- Infrastructure as Code (Terraform, Puppet, or similar)
- Service mesh (Istio/Linkerd), CDN, WAF experience
- SOC2 / ISO27001 exposure
Did you find something suspicious?
Posted by
Posted in
DevOps / SRE
Functional Area
Site Reliability Engineering
Job Code
1607288