We are seeking an experienced Site Reliability Engineer (SRE) to ensure the dependability, scalability, and performance of enterprise-scale applications.

The ideal candidate will work closely with DevOps, infrastructure, and application teams to drive operational excellence, implement automation, enhance system reliability, and improve observability.

This role requires a strong understanding of modern monitoring frameworks, distributed systems, and performance engineering, along with hands-on experience in building reliable, resilient, and scalable platforms.

Key Responsibilities :

Reliability & Performance :

- Ensure the availability, stability, security, and scalability of large-scale applications and platforms.

- Develop and implement processes to maintain high system uptime through performance tuning, capacity planning, system validations, and proactive maintenance.

- Lead and support incident management including root cause analysis, corrective measures, and long-term reliability improvements.

DevOps & Automation :

- Collaborate with DevOps teams, sharing responsibility for application reliability, deployment stability, and production health.

- Automate repetitive tasks, optimize workflows, and implement best practices to reduce operational overhead.

- Drive automation initiatives that enhance deployment pipelines, reduce manual interventions, and improve environment consistency.

Monitoring & Observability :

- Design and implement robust monitoring, logging, and alerting solutions to track application performance, SLO/SLA adherence, and operational health.

- Build dashboards, service maps, and distributed tracing views to ensure full system observability.

- Respond to alerts, system events, and production incidents as part of on-call responsibilities.

Collaboration & Process Improvement :

- Work closely with infrastructure, platform, and application SMEs to promote reliability-focused engineering culture.

- Support continuous improvement programs aimed at reducing downtime, boosting resiliency, and strengthening incident response.

- Share best practices, contribute to technical documentation, and mentor junior engineers where necessary.

Technical Skills Required :

Monitoring & APM (Application Performance Monitoring) :

Strong experience with New Relic :

- Service maps, distributed tracing, dashboards, custom events, NRQL queries

- JVM performance monitoring (heap, thread pools, garbage collection)

- Setting up alerts and performance indicators

Application Performance & JVM Tuning :

- Good understanding of Spring Boot applications and their performance tuning.

- Expertise in JVM parameters, thread management, memory tuning, and application configuration optimization.

Logging & Analytics :

Strong knowledge of Splunk :

- Writing advanced queries

- Creating dashboards

- Log analytics and troubleshooting

DevOps & CI/CD :

Hands-on experience with :

- Bitbucket, CloudBees, AWS Cloud, and CI/CD pipeline setup

- Automated build, test, and deployment pipelines

Distributed Tracing :

- Familiarity with tracing frameworks such as Jaeger, OpenTelemetry, or similar tools.

What We Are Looking For :

- Strong problem-solving skills with the ability to handle complex, distributed systems.

- Excellent communication and collaboration skills to work cross-functionally.

- A mindset focused on automation, system robustness, and long-term reliability.

- Ability to work in fast-paced environments with on-call responsibilities