About The Role :

We are looking for a highly experienced Staff Site Reliability Engineer (SRE) to drive the reliability, performance, and operational excellence of our core production systems.

This is a senior, hands-on role that requires deep expertise in large-scale distributed systems, complex incident management, and building world-class observability platforms.

Key Responsibilities :

Reliability Engineering :

- Define, measure, and enforce Service Level Objectives (SLOs) and Service Level Indicators (SLIs) for critical platform services.

- Drive down toil by promoting self-service and automation.

Observability Platform :

- Lead the design and implementation of our global observability stack, including metric collection (Prometheus/M3DB), distributed tracing (Jaeger/OpenTelemetry), and logging (Loki/Elasticsearch).

Incident Management :

- Act as a technical leader during high-severity incidents, perform in-depth Root Cause Analysis (RCA), and implement long-term preventative measures.

Performance Tuning :

- Conduct performance analysis and capacity planning for the entire platform, optimizing infrastructure and application bottlenecks.

Security & Compliance :

- Partner with the security team to enforce security controls and best practices across the infrastructure layer.

Mentorship & Evangelism :

- Mentor SRE and DevOps teams, and evangelize reliability best practices and engineering excellence across all product development teams.

Technical Skills (Must-Have) :

Distributed Systems :

- Proven experience designing, running, and debugging large-scale distributed systems and microservices in a high-traffic environment.

Cloud & Kubernetes :

- Expert proficiency in managing highly available Kubernetes clusters (i.e., K8s on GCP/AWS/Azure) and their underlying cloud resources.

Observability Stack :

- Deep, hands-on experience with modern observability tools (Prometheus, Grafana, Jaeger/OpenTelemetry).

Programming/Scripting :

- Expert in at least one modern programming language (Go/Python) for writing operators, automation tooling, and extending monitoring systems.

Infrastructure as Code (IaC) :

- Advanced knowledge of Terraform for managing multi-cloud infrastructure.

Networking :

- Advanced understanding of network concepts in a cloud/container environment (service mesh, network policies, load balancing).

Qualifications :

- Bachelor's or Master's degree in Computer Science or a related technical field.

- 8+ years of professional experience in SRE, DevOps, or Infrastructure Engineering roles.

- History of successfully implementing reliability improvements that result in measurable SLO adherence