HamburgerMenu
hirist

SolarWinds - Senior Manager - Site Reliability Engineering

Solarwinds India Pvt Ltd
5 - 7 Years
Bangalore

Posted on: 23/03/2026

Job Description

Role Overview :


SolarWinds is looking for a Senior Manager, Site Reliability Engineering (SRE) to lead reliability, scalability, and operational excellence for large-scale, cloud-native, data-intensive SaaS platforms.

This role combines people leadership, technical depth, and operational ownership.

You will manage and grow SRE teams responsible for production systems while remaining close to platform architecture, reliability engineering, incident response, and automation strategy.

The ideal candidate has operated distributed systems in production environments and is comfortable guiding teams through complex troubleshooting, reliability improvements, and architectural decisions.

This role requires balancing availability, performance, operational efficiency, and engineering velocity across large-scale SaaS services.

Responsibilities :


- Lead and mentor SRE teams responsible for the reliability, availability, and performance of production SaaS platforms.

- Own and drive production reliability outcomes, including uptime, latency, scalability, capacity planning, and operational readiness.

- Oversee data-intensive distributed systems, including technologies such as ClickHouse, Kafka, ZooKeeper, MySQL, Redis, and Flink.

- Guide and review Kubernetes platform operations at scale, including cluster lifecycle management, upgrades, troubleshooting, and capacity planning.

- Establish and evolve SRE practices, including SLIs/SLOs, alerting strategies, incident management, and post-incident reviews.

- Lead and participate in production incident response, guiding teams through debugging, root cause analysis, and long-term remediation.

- Promote and enforce an automation-first approach, reducing manual operational work through scripting, tooling, and platform improvements.

- Partner with Engineering, Platform, Product, and Security teams to embed reliability into system design and delivery.

- Drive adoption of GitOps, service mesh, and observability practices across teams.

- Lead cloud infrastructure operations across AWS and Azure, ensuring secure, resilient, and cost-effective platform operations.

- Provide technical mentorship and guidance, helping engineers diagnose complex production issues and improve system reliability.

Must Have Qualifications :



- Proven experience leading SRE, Platform, or Infrastructure teams supporting production, customer-facing SaaS systems.

- Strong hands-on experience operating Kubernetes clusters in production environments, including :

- Cluster lifecycle management and upgrades.

- Troubleshooting platform and workload issues.

- Autoscaling and resilience mechanisms (HPA, VPA, KEDA, Cluster Autoscaler, Pod Disruption Budgets).

- Observability and monitoring (Prometheus, Grafana).

- Experience operating distributed data platforms in production environments, such as ClickHouse, Kafka, ZooKeeper, MySQL, Redis, or Flink.

- Practical experience with GitOps and service mesh technologies (e.g., Flux, Kustomize, Istio).

- Strong automation mindset with hands-on experience using Python and/or Go to reduce operational overhead and improve reliability.

- Extensive experience working with AWS and Azure managed services, including EKS/AKS, Aurora, ElastiCache, storage services, load balancers, VPC, and KMS.

- Demonstrated ownership of incident response, root cause analysis, and long-term reliability improvements.

- Ability to collaborate effectively with engineering leadership and cross-functional teams.

SolarWinds is an Equal Employment Opportunity Employer.

SolarWinds will consider all qualified applicants for employment without regard to race, color, religion, sex, age, national origin, sexual orientation, gender identity, marital status, disability, veteran status or any other characteristic protected by law.


info-icon

Did you find something suspicious?

Similar jobs that you might be interested in