- Ensure high availability, reliability, and performance of production systems across multiple geographies.

- Design, deploy, and maintain Kubernetes clusters and containerized workloads using Docker and Helm charts.

- Build and manage CI/CD pipelines using Jenkins and Bitbucket.

- Operate and monitor distributed systems using Kafka and RabbitMQ.

- Manage cloud infrastructure on AWS and GCP, including load balancers, autoscaling, and networking.

- Implement observability : monitoring, logging, alerting, and incident response (SLIs, SLOs, SLAs).

- Perform capacity planning, disaster recovery, and failover testing.

- Strengthen network and application security, including IAM, secrets management, and vulnerability remediation.

- Collaborate with backend, frontend and mobile teams (Reactjs, Node.js, Android, iOS) to improve system reliability.

- Lead post-incident reviews and drive continuous reliability improvements through automation.

Requirements :

- Strong experience with Kubernetes, Docker, Helm in production environments.

- Hands-on with AWS and/or GCP networking and load balancing.

- Experience operating message brokers (Kafka, RabbitMQ) at scale.

- Proficiency in CI/CD automation (Jenkins, Bitbucket Pipelines).

- Solid understanding of Linux, TCP/IP, DNS, firewalls, and zero-trust principles.

- Scripting skills in Bash, Python, or Node.js.

- Experience supporting global, latency-sensitive applications.

- Strong incident management and root cause analysis skills.

- Experience with Chaos engineering and resilience testing

Good to Have :

- Infrastructure as Code (Terraform, Puppet, or similar)

- Service mesh (Istio/Linkerd), CDN, WAF experience

- SOC2 / ISO27001 exposure