Posted on: 07/09/2025
Position : Site Reliability Engineer
Experience : 5 - 9 Years
Location : Bangalore, India
Job Summary :
We are seeking an experienced Site Reliability Engineer (SRE) with 5-9 years of experience to join our Platform Engineering team. This role is crucial for ensuring the high availability, performance, and scalability of our AI-powered code review platform. As a key member of the team, you will operate at the intersection of software engineering and systems operations, building the foundational platforms and automation that enable our engineering teams to deploy, monitor, and scale our services reliably.
You will be instrumental in enhancing the reliability of critical services that process millions of code reviews, building sophisticated automation platforms, and owning the infrastructure that powers our AI-driven analysis engine. This role involves working with cutting-edge technologies, including large language models, real-time processing systems, and distributed architectures.
Key Responsibilities :
Infrastructure and Platform Ownership :
- Design, implement, and maintain a scalable infrastructure on Google Cloud Platform (GCP).
- You will own and operate critical platform services and build and maintain Infrastructure as Code (IaC) using Terraform to ensure consistent and reproducible deployments.
Reliability and Performance Engineering :
- Implement and maintain SLI/SLO frameworks to meet reliability commitments.
- You will deploy comprehensive monitoring, alerting, and observability solutions using Datadog and custom instrumentation.
- Your duties will also include conducting thorough incident response, root cause analysis, and post-mortem processes to continuously improve system reliability.
- You will be responsible for optimizing application and infrastructure performance and designing and implementing chaos engineering practices to proactively identify system weaknesses.
Automation and Developer Experience :
- Develop self-service platforms and tooling that empower engineering teams to deploy, monitor, and troubleshoot their services independently.
- You will automate operational tasks such as scaling, backup/recovery, and security patching.
- A key part of your role will be to create and maintain infrastructure APIs and abstractions that simplify complex operations for development teams.
Security and Compliance :
- You will be tasked with integrating security best practices into all infrastructure and platform services. This includes implementing security monitoring, vulnerability scanning, and compliance reporting.
- You will also design secure network architectures and establish disaster recovery and business continuity plans.
Required Skills & Qualifications :
Core Experience :
- Experience with AI/ML infrastructure and tools.
- Background in managing high-traffic web applications and API services.
- Experience with disaster recovery planning and execution.
- Knowledge of FinOps practices and cost optimization.
- Experience with performance testing and capacity planning methodologies.
- Contributions to open-source SRE or infrastructure tooling projects.
Did you find something suspicious?
Posted By
Posted in
DevOps / SRE
Functional Area
Site Reliability Engineering
Job Code
1542140
Interview Questions for you
View All