Role Overview :

We are seeking a Senior Site Reliability Engineer with strong experience in building and maintaining scalable, resilient systems.

The ideal candidate will have hands-on expertise in cloud-native technologies, infrastructure as code, observability, and automation, with a focus on Google Cloud Platform (GCP).

Key Responsibilities :

- Ensure the stability and reliability of cloud-native applications deployed on GCP, containerized with Docker and orchestrated via Kubernetes.

- Define, implement, and monitor SLOs, SLAs, and SLIs to measure system performance and user experience.

- Automate infrastructure provisioning using Terraform and manage Kubernetes configurations with Kustomize and Helm.

- Develop and maintain monitoring and alerting systems using Datadog and GCP-native tools.

- Conduct incident analysis and postmortems to drive continuous improvement.

- Collaborate with development teams to integrate reliability practices into CI/CD pipelines using GitHub Actions.

- Manage and troubleshoot database systems, particularly PostgreSQL and Cassandra.

- Apply networking knowledge and Linux system administration skills to troubleshoot and optimize system connectivity and performance.

Qualifications :

Education :

Bachelors or Masters degree in Computer Science, Software Engineering, or equivalent practical experience.

Work Experience & Skills :

- 5+ years of experience in Site Reliability Engineering.

- Proven experience designing and operating elastic, resilient systems in cloud environments.

- Strong understanding of GCP, Kubernetes, and container orchestration.

- Proficiency in infrastructure as code and configuration management tools (Terraform, Helm, Kustomize).

- Experience with monitoring and observability tools (Datadog, GCP Monitoring).

- Solid scripting skills in bash and familiarity with automation frameworks.

- Experience with CI/CD pipelines, especially using GitHub Actions.

- Familiarity with networking fundamentals and troubleshooting.

- Strong coding skills and ability to develop reliability-focused tooling.

- Excellent communication skills in English (written and spoken).

Other Requirements :

- Strong problem-solving skills and a process-oriented mindset.

- Ability to work independently and collaboratively in a fast-paced environment.

- Passion for clean code, automation, and continuous improvement.

Nice-to-Have :

- Familiarity with monitoring tools (e.g., DataDog, Prometheus, GCP Monitoring).

- Experience working in Agile/Scrum teams.