Posted on: 28/01/2026
Job Role : SRE App Focus
Experience : 6-10Years
Location : (Hybrid)Bengaluru, Hyderabad
Andersen is hiring a Site Reliability Engineer to drive reliability and performance for large-scale digital insurance platforms, enhancing integrations, optimizing cloud systems, and ensuring stable, high-quality service delivery.
The customer is a well-established global organization providing financial protection and risk-management services across various markets.
With a diverse portfolio and teams operating in multiple regions, the company supports businesses and individuals through reliable, scalable solutions.
The project focuses on enhancing large-scale digital platforms, improving cloud performance, optimizing integrations, and modernizing systems to support efficient service delivery and ongoing expansion.
Responsibilities :
- Ensuring high availability, performance, scalability, and overall reliability of application infrastructure through proactive monitoring, automation, and continuous improvement.
- Developing and implementing performance optimization strategies, including code optimization, memory management, load testing, and capacity planning.
- Implementing and maintaining end-to-end observability, including real-time telemetry, CUJ-level metrics, dashboards, alerts, and actionable reporting.
- Monitoring Critical User Journeys (CUJs) with product and business teams to improve end-to-end user experience and service reliability.
- Managing SLIs, SLOs, SLAs, and error budgets across critical services while ensuring uptime and availability targets are consistently met.
- Implementing next-generation architectural patterns and SRE recommendations to enhance fault tolerance, resilience, and disaster recovery capabilities.
- Identifying and mitigating reliability risks, proactively addressing issues that may impact availability and minimizing service disruptions.
- Automating key operational tasks such as deployments, scaling, failover, and remediation, and reducing manual toil through tools and process improvements.
- Leading incident response efforts, participating in on-call rotations, and driving automated remediation for common failure scenarios.
- Performing root-cause analysis, conducting blameless post-mortems, and implementing corrective actions to prevent recurring incidents.
- Creating and maintaining comprehensive runbooks, operational documentation, and guidelines for incident response and system reliability.
- Collaborating with global and regional digital teams on reliability best practices, mentoring junior SREs, and contributing to the hiring and onboarding of new SRE candidates.
Must-haves :
- Experience in application support and reliability engineering environments for 6+ years.
- Strong technical background with proficiency in software development principles, application production support, SDLC best practices, and Agile methodology.
- Hands-on SRE skills, including familiarity with SLOs, SLIs, error budgets, incident management, and conducting blameless post-mortems.
- Solid understanding of application architectures with the ability to analyze systems and identify areas for improvement.
- Experience working with monitoring, logging, and observability tools to track and optimize application performance.
- Proficiency in scripting and automation tools (e.g., Python, Bash, Terraform) to reduce toil and improve operational efficiency.
- Strong incident response and troubleshooting skills with the ability to perform effective root cause analysis.
- Excellent collaboration and communication skills for working with cross-functional teams and clearly explaining technical concepts.
- Ability to coach and mentor team members in SRE practices and foster a culture of reliability.
- Proactive mindset with a focus on continuous improvement to enhance application reliability and performance.
- Level of English - from Intermediate+ and above.
Did you find something suspicious?
Posted by
Posted in
DevOps / SRE
Functional Area
Site Reliability Engineering
Job Code
1606836