Posted on: 27/01/2026



Description :
Role Summary :
We are seeking an experienced Manager Site Reliability Engineering (SRE) to lead a high-impact team responsible for the stability, availability, reliability, security, and performance of large-scale, distributed systems. This role combines technical leadership, people management, and operational excellence, driving automation, resilience, and engineering productivity across platforms and services.
You will lead initiatives across infrastructure reliability, application operations, service management, security posture, and engineering enablement, while partnering with product, engineering, and vendor teams to deliver highly reliable, scalable, and secure services.
Key Responsibilities :
SRE Leadership & Team Management :
- Build and lead high-performing SRE teams through strong organizational design, mentoring, and coaching
- Foster a culture of reliability engineering, automation-first thinking, and continuous improvement
- Provide technical guidance and subject matter expertise to team members
- Drive skills development, training programs, and career growth for SRE engineers
Reliability Engineering & Operations :
- Lead the design, implementation, and operation of reliable, scalable distributed systems
- Define and manage SLOs, SLIs, SLAs, error budgets, and reliability metrics
- Drive incident management, root cause analysis (RCA), and post-incident reviews
- Improve platform reliability, performance, resilience, and service availability
- Lead capacity planning, performance optimization, and service efficiency initiatives
Automation, Tooling & Platform Engineering :
- Drive self-service tooling, internal platforms, and engineering productivity frameworks
- Implement automation for infrastructure provisioning, monitoring, deployments, and operations
- Standardize tooling, processes, and operational frameworks across teams
- Improve CI/CD, observability, and operational workflows
Security & Risk Management :
- Lead cybersecurity initiatives including :
- Vulnerability management and remediation
- Secure configuration and hardening
- Security testing and validation
- Incident response and remediation
- Partner with security teams to implement technical controls and compliance standards
Cross-Functional Collaboration :
- Partner with engineering, product, architecture, and vendor teams to :
- Ensure reliable software releases
- Improve production quality and engineering productivity
- Drive operational excellence and service maturity
- Support release management, deployment strategies, and production readiness
Governance & Operational Excellence :
- Drive change management, risk management, and service governance programs
- Develop and optimize processes, procedures, and operational frameworks
- Support departmental planning, budgeting, and operational efficiency initiatives
Required Skills & Experience :
- 10+ years of experience in SRE, DevOps, Infrastructure Engineering, or Platform Engineering
- Strong expertise in :
- Distributed systems architecture
- Cloud platforms (AWS/Azure/GCP or hybrid environments)
- Linux/Unix systems
- Networking fundamentals
Deep understanding of :
- Observability (monitoring, logging, tracing, alerting)
- Incident management & reliability engineering practices
- CI/CD pipelines and automation
- Infrastructure as Code (IaC)
- Strong background in systems design, performance tuning, and scalability
- Experience managing production systems at scale
- Proven people leadership and team management experience
Technical Stack (Typical Environment) :
- Cloud : AWS / Azure / GCP
- Containers & Orchestration : Docker, Kubernetes
- CI/CD : Jenkins, GitHub Actions, GitLab CI
- IaC : Terraform, CloudFormation, Ansible
- Observability : Prometheus, Grafana, ELK, Datadog, Splunk
- Scripting/Programming : Python, Go, Bash
- Version Control : Git
Leadership & Behavioral Competencies :
- Strong decision-making and judgment
- Excellent stakeholder management and communication
- Ability to operate in high-pressure, production-critical environments
- Strategic mindset with hands-on technical depth
- Strong collaboration and influence skills
Work Requirements :
- Participation in on-call leadership rotations
- Ability to support 24/7 production environments
- Flexible scheduling, including nights/weekends when required for critical operations
Why This Role Matters :
This role is critical to ensuring platform reliability, service resilience, engineering velocity, and customer trust. You will shape the reliability strategy, define operational excellence standards, and build the foundation for scalable, secure, and high-performing digital platforms.
Did you find something suspicious?
Posted by
Posted in
DevOps / SRE
Functional Area
DevOps / Cloud
Job Code
1606164