We are seeking an experienced Site Reliability Engineer (SRE) with 5-9 years of experience to join our Platform Engineering team. This role is crucial for ensuring the high availability, performance, and scalability of our AI-powered code review platform. As a key member of the team, you will operate at the intersection of software engineering and systems operations, building the foundational platforms and automation that enable our engineering teams to deploy, monitor, and scale our services reliably.

You will be instrumental in enhancing the reliability of critical services that process millions of code reviews, building sophisticated automation platforms, and owning the infrastructure that powers our AI-driven analysis engine. This role involves working with cutting-edge technologies, including large language models, real-time processing systems, and distributed architectures.

Key Responsibilities :

Infrastructure and Platform Ownership :

- Design, implement, and maintain a scalable infrastructure on Google Cloud Platform (GCP).

- You will own and operate critical platform services and build and maintain Infrastructure as Code (IaC) using Terraform to ensure consistent and reproducible deployments.

Reliability and Performance Engineering :

- Implement and maintain SLI/SLO frameworks to meet reliability commitments.

- You will deploy comprehensive monitoring, alerting, and observability solutions using Datadog and custom instrumentation.

- Your duties will also include conducting thorough incident response, root cause analysis, and post-mortem processes to continuously improve system reliability.

- You will be responsible for optimizing application and infrastructure performance and designing and implementing chaos engineering practices to proactively identify system weaknesses.

Automation and Developer Experience :

- Develop self-service platforms and tooling that empower engineering teams to deploy, monitor, and troubleshoot their services independently.

- You will automate operational tasks such as scaling, backup/recovery, and security patching.

- A key part of your role will be to create and maintain infrastructure APIs and abstractions that simplify complex operations for development teams.

Security and Compliance :

- You will be tasked with integrating security best practices into all infrastructure and platform services. This includes implementing security monitoring, vulnerability scanning, and compliance reporting.

- You will also design secure network architectures and establish disaster recovery and business continuity plans.

Required Skills & Qualifications :

Core Experience :

- 5+ years of hands-on experience in Site Reliability Engineering, Platform Engineering, or DevOps roles.

- A proven track record of managing production systems at scale in high-growth technology companies.

Technical Proficiency :

- Programming Languages : Proficiency in Node.js and TypeScript for building automation tools.

- Infrastructure as Code : Advanced experience with Terraform.

- Monitoring & Observability : Hands-on experience with Datadog or similar platforms like Prometheus, Grafana, or the ELK stack.

- Cloud Platforms : Comprehensive experience with GCP services, including Compute Engine, GKE, Cloud Run, Cloud SQL, and Cloud Storage.

- Strong Linux/Unix systems skills.

- Experience with Kubernetes and Docker.

- Understanding of microservices architecture and distributed systems principles.

Preferred Skills :

- Experience with AI/ML infrastructure and tools.

- Background in managing high-traffic web applications and API services.

- Experience with disaster recovery planning and execution.

- Knowledge of FinOps practices and cost optimization.

- Experience with performance testing and capacity planning methodologies.

- Contributions to open-source SRE or infrastructure tooling projects.