Posted on: 23/03/2026
Role Overview :
SolarWinds is looking for a Senior Manager, Site Reliability Engineering (SRE) to lead reliability, scalability, and operational excellence for large-scale, cloud-native, data-intensive SaaS platforms.
This role combines people leadership, technical depth, and operational ownership.
You will manage and grow SRE teams responsible for production systems while remaining close to platform architecture, reliability engineering, incident response, and automation strategy.
The ideal candidate has operated distributed systems in production environments and is comfortable guiding teams through complex troubleshooting, reliability improvements, and architectural decisions.
This role requires balancing availability, performance, operational efficiency, and engineering velocity across large-scale SaaS services.
Responsibilities :
- Lead and mentor SRE teams responsible for the reliability, availability, and performance of production SaaS platforms.
- Own and drive production reliability outcomes, including uptime, latency, scalability, capacity planning, and operational readiness.
- Oversee data-intensive distributed systems, including technologies such as ClickHouse, Kafka, ZooKeeper, MySQL, Redis, and Flink.
- Guide and review Kubernetes platform operations at scale, including cluster lifecycle management, upgrades, troubleshooting, and capacity planning.
- Establish and evolve SRE practices, including SLIs/SLOs, alerting strategies, incident management, and post-incident reviews.
- Lead and participate in production incident response, guiding teams through debugging, root cause analysis, and long-term remediation.
- Promote and enforce an automation-first approach, reducing manual operational work through scripting, tooling, and platform improvements.
- Partner with Engineering, Platform, Product, and Security teams to embed reliability into system design and delivery.
- Drive adoption of GitOps, service mesh, and observability practices across teams.
- Lead cloud infrastructure operations across AWS and Azure, ensuring secure, resilient, and cost-effective platform operations.
- Provide technical mentorship and guidance, helping engineers diagnose complex production issues and improve system reliability.
Must Have Qualifications :
Did you find something suspicious?
Posted by
Posted in
DevOps / SRE
Functional Area
Site Reliability Engineering
Job Code
1622768