Posted on: 20/04/2026
Role Overview :
We are seeking an experienced SRE Manager to lead and scale our Site Reliability Engineering function. This role will be responsible for ensuring system reliability, scalability, and performance, while driving automation and operational excellence across the platform.
The ideal candidate will have a strong background in DevOps, cloud infrastructure, and distributed systems, along with proven experience in team leadership and incident management.
Key Responsibilities :
Team Leadership & Management :
- Lead, mentor, and manage a team of SRE/DevOps engineers.
- Foster a culture of ownership, reliability, and continuous improvement.
- Drive team performance, skill development, and best practices adoption.
Reliability & Performance Engineering :
- Define and implement SRE frameworks including SLIs, SLOs, and error budgets.
- Ensure high levels of system availability, scalability, and performance.
- Continuously monitor and optimize system health and uptime.
Automation & DevOps Practices :
- Drive automation initiatives across infrastructure, deployment, and operations.
- Own and optimize CI/CD pipelines and release management processes.
- Promote infrastructure as code (IaC) and DevOps best practices.
Incident Management & RCA :
- Lead incident response processes, ensuring timely resolution of production issues.
- Conduct root cause analysis (RCA) and implement preventive measures.
- Establish processes to minimize downtime and improve resilience.
Observability & Monitoring :
- Design and implement monitoring, logging, and observability frameworks.
- Utilize tools to proactively detect and resolve system anomalies.
Cloud & Infrastructure Management :
- Manage and optimize cloud infrastructure across AWS, Azure, or GCP.
- Ensure efficient resource utilization, cost optimization, and scalability.
- Implement disaster recovery and business continuity plans.
Cross-Functional Collaboration :
- Collaborate with engineering, product, and operations teams to improve system reliability.
- Align infrastructure and reliability strategies with business goals.
Required Skills & Experience :
- 7+ years of experience in SRE / DevOps roles.
- 3+ years of experience in team management or leadership roles.
- Hands-on experience with cloud platforms (AWS, Azure, or GCP).
- Strong experience with CI/CD tools such as Jenkins, GitLab CI, or similar.
- Expertise in containerization and orchestration (Docker, Kubernetes).
- Proficiency in scripting languages such as Python or Bash.
- Experience with Infrastructure as Code (Terraform, CloudFormation).
- Strong knowledge of monitoring and observability tools (Prometheus, Grafana, ELK).
Preferred Qualifications :
- Experience working with microservices architecture.
- Relevant cloud certifications.
- Exposure to high-scale or e-commerce platforms.
- Knowledge of chaos engineering practices.
Key Competencies :
- Leadership and team management
- Problem-solving and analytical thinking
- Ownership and accountability
- Stakeholder management and communication
- Continuous improvement mindset
Did you find something suspicious?
Posted by
Posted in
DevOps / SRE
Functional Area
Site Reliability Engineering
Job Code
1629756