Job Role : SRE App Focus

Experience : 6-10Years

Location : (Hybrid)Bengaluru, Hyderabad

Andersen is hiring a Site Reliability Engineer to drive reliability and performance for large-scale digital insurance platforms, enhancing integrations, optimizing cloud systems, and ensuring stable, high-quality service delivery.

The customer is a well-established global organization providing financial protection and risk-management services across various markets.

With a diverse portfolio and teams operating in multiple regions, the company supports businesses and individuals through reliable, scalable solutions.

The project focuses on enhancing large-scale digital platforms, improving cloud performance, optimizing integrations, and modernizing systems to support efficient service delivery and ongoing expansion.

Responsibilities :

- Ensuring high availability, performance, scalability, and overall reliability of application infrastructure through proactive monitoring, automation, and continuous improvement.

- Developing and implementing performance optimization strategies, including code optimization, memory management, load testing, and capacity planning.

- Implementing and maintaining end-to-end observability, including real-time telemetry, CUJ-level metrics, dashboards, alerts, and actionable reporting.

- Monitoring Critical User Journeys (CUJs) with product and business teams to improve end-to-end user experience and service reliability.

- Managing SLIs, SLOs, SLAs, and error budgets across critical services while ensuring uptime and availability targets are consistently met.

- Implementing next-generation architectural patterns and SRE recommendations to enhance fault tolerance, resilience, and disaster recovery capabilities.

- Identifying and mitigating reliability risks, proactively addressing issues that may impact availability and minimizing service disruptions.

- Automating key operational tasks such as deployments, scaling, failover, and remediation, and reducing manual toil through tools and process improvements.

- Leading incident response efforts, participating in on-call rotations, and driving automated remediation for common failure scenarios.

- Performing root-cause analysis, conducting blameless post-mortems, and implementing corrective actions to prevent recurring incidents.

- Creating and maintaining comprehensive runbooks, operational documentation, and guidelines for incident response and system reliability.

- Collaborating with global and regional digital teams on reliability best practices, mentoring junior SREs, and contributing to the hiring and onboarding of new SRE candidates.

Must-haves :

- Experience in application support and reliability engineering environments for 6+ years.

- Strong technical background with proficiency in software development principles, application production support, SDLC best practices, and Agile methodology.

- Hands-on SRE skills, including familiarity with SLOs, SLIs, error budgets, incident management, and conducting blameless post-mortems.

- Solid understanding of application architectures with the ability to analyze systems and identify areas for improvement.

- Experience working with monitoring, logging, and observability tools to track and optimize application performance.

- Proficiency in scripting and automation tools (e.g., Python, Bash, Terraform) to reduce toil and improve operational efficiency.

- Strong incident response and troubleshooting skills with the ability to perform effective root cause analysis.

- Excellent collaboration and communication skills for working with cross-functional teams and clearly explaining technical concepts.

- Ability to coach and mentor team members in SRE practices and foster a culture of reliability.

- Proactive mindset with a focus on continuous improvement to enhance application reliability and performance.

- Level of English - from Intermediate+ and above.