Posted on: 28/01/2026
Role : SRE Lead
Andersen is hiring a Site Reliability Engineering Lead to drive reliability and performance for large-scale digital insurance platforms, enhancing integrations, optimizing cloud systems, and ensuring stable, high-quality service delivery.
The customer is a well-established global organization providing financial protection and risk-management services across various markets. With a diverse portfolio and teams operating in multiple regions, the company supports businesses and individuals through reliable, scalable solutions.
The project focuses on enhancing large-scale digital platforms, improving cloud performance, optimizing integrations, and modernizing systems to support efficient service delivery and ongoing expansion.
Responsibilities :
- Piloting SRE adoption, assessing current Digital Applications architecture, and implementing highly reliable, fault-tolerant design patterns.
- Defining Critical User Journeys (CUJs), SLOs/SLIs, and error budgets, ensuring alignment with business and user experience.
- Maintaining a prioritized toil backlog to drive automation and operational efficiency.
- Coaching production support teams on SRE principles and practices.
- Working with Regional and Global Digital teams to support SRE rollout and adoption.
- Preparing and delivering training sessions and materials, fostering continuous improvement.
- Providing recommendations on system architecture, fault tolerance, and disaster recovery.
- Delivering uptime, performance, and availability targets through SLIs, SLOs, SLAs, and error budgets.
- Monitoring risks to ensure service reliability and minimizing disruptions.
- Embedding CUJ-level metrics and telemetry into all relevant services.
- Implementing observability platforms, ensuring full monitoring coverage.
- Building actionable dashboards, alerts, and reports using standard observability tools (including OpenTelemetry).
- Automating deployments, failover, scaling, and remediation processes.
- Eliminating manual work by promoting automation, improved tooling, and optimized workflows.
- Leading incident response during outages and conducting root cause analysis.
- Developing automated remediation for common failure scenarios.
- Participating in on-call rotations and conducting blameless post-mortems with corrective actions.
Must-haves :
- Experience in infrastructure teams for 15+ years.
- Strong technical background with solid knowledge of software development principles, application production support, SDLC best practices, and Agile methodology.
- Hands-on SRE experience with a strong understanding of SLOs, SLIs, error budgets, incident management, and conducting blameless post-mortems.
- Strong ability to analyze and understand application architectures and identify areas for improvement.
- Experience working with monitoring, logging, and observability tools to assess and improve application performance.
- Proficiency in scripting and automation tools, including Python, Bash, and Terraform, to reduce toil and enhance operational efficiency.
- Strong incident response and troubleshooting skills with the ability to perform effective root cause analysis.
- Excellent communication and collaboration skills, enabling effective interaction with cross-functional teams and clear explanation of technical concepts.
- Ability to coach and mentor team members in SRE practices and support the development of a reliability-focused culture.
- Practical experience working in Agile teams and applying Agile development practices.
- Proactive mindset focused on continuous improvement to increase system reliability and performance.
- Level of English - from Intermediate+ and above.
Did you find something suspicious?
Posted by
Posted in
DevOps / SRE
Functional Area
Site Reliability Engineering
Job Code
1606827