Posted on: 14/12/2025
Description :
Location: Pan India (Except Mumbai)
About the Role :
We are seeking an experienced Site Reliability Engineer (SRE) to ensure the dependability, scalability, and performance of enterprise-scale applications.
The ideal candidate will work closely with DevOps, infrastructure, and application teams to drive operational excellence, implement automation, enhance system reliability, and improve observability.
This role requires a strong understanding of modern monitoring frameworks, distributed systems, and performance engineering, along with hands-on experience in building reliable, resilient, and scalable platforms.
Key Responsibilities :
Reliability & Performance :
- Collaborate with DevOps teams, sharing responsibility for application reliability, deployment stability, and production health.
- Automate repetitive tasks, optimize workflows, and implement best practices to reduce operational overhead.
- Drive automation initiatives that enhance deployment pipelines, reduce manual interventions, and improve environment consistency.
Monitoring & Observability :
- Design and implement robust monitoring, logging, and alerting solutions to track application performance, SLO/SLA adherence, and operational health.
- Build dashboards, service maps, and distributed tracing views to ensure full system observability.
- Respond to alerts, system events, and production incidents as part of on-call responsibilities.
Collaboration & Process Improvement :
- Work closely with infrastructure, platform, and application SMEs to promote reliability-focused engineering culture.
- Support continuous improvement programs aimed at reducing downtime, boosting resiliency, and strengthening incident response.
- Share best practices, contribute to technical documentation, and mentor junior engineers where necessary.
Technical Skills Required :
Monitoring & APM (Application Performance Monitoring) :
Strong experience with New Relic :
- Good understanding of Spring Boot applications and their performance tuning.
- Expertise in JVM parameters, thread management, memory tuning, and application configuration optimization.
Logging & Analytics :
Strong knowledge of Splunk :
- Writing advanced queries
Hands-on experience with :
- Bitbucket, CloudBees, AWS Cloud, and CI/CD pipeline setup
- Automated build, test, and deployment pipelines
Distributed Tracing :
- Familiarity with tracing frameworks such as Jaeger, OpenTelemetry, or similar tools.
What We Are Looking For :
- Strong problem-solving skills with the ability to handle complex, distributed systems.
- Excellent communication and collaboration skills to work cross-functionally.
- A mindset focused on automation, system robustness, and long-term reliability.
- Ability to work in fast-paced environments with on-call responsibilities
Did you find something suspicious?
Posted by
Posted in
DevOps / SRE
Functional Area
Site Reliability Engineering
Job Code
1589622
Interview Questions for you
View All