Job Title : Senior Site Reliability Engineer

Experience : 4- 7 Years

Employment : Full-time

Location : Remote

Job Description :

We are looking for a Senior SRE to join our fully remote team to own the reliability, scalability, and performance of our global infrastructure. You won't just be "managing" clusters; you will be architecting multi-region resilience, building sophisticated CI/CD pipelines, and writing the Python automation that keeps our systems self-healing.

Technical Requirements :

- Kubernetes Mastery : Deep expertise in cluster architecture, RBAC, networking, and workload isolation. Experience with Helm, Operators, and scaling (HPA/Cluster Autoscaler) is essential.

- Programming & Scripting : Strong hands-on proficiency in Python. You should be comfortable writing scripts to interact with APIs and automate infrastructure.

- CI/CD & GitOps : Proven experience building and optimizing deployment pipelines using GitHub Actions.

- Observability Stack : Experience implementing monitoring and tracing for complex distributed systems (specifically K8s and Kafka).

- Architect Resilience : Design and operate multi-AZ and multi-region Kubernetes deployments, ensuring DR (Disaster Recovery) readiness and seamless cluster switchovers.

- Engineer for Reliability : Define SLIs, SLOs, and error budgets. Youll be the champion of "system health," building end-to-end observability (metrics, logs, traces) across K8s and Kafka.

- Automate Everything : This is not a "manual click" role. You will use Python and GitHub Actions to build robust CI/CD workflows and automate complex operational tasks.

- Lead through Incidents : Drive incident response and conduct blameless postmortems that result in long-term engineering fixes, not just temporary patches.

Why Join Us?

- 100% Remote : Work from wherever you are most productive.

- High Impact : Youll have a direct hand in defining the best practices and reliability standards for our entire engineering org.

- Complex Challenges : Solve high-scale problems involving Kafka, multi-region architecture, and high-availability systems.