Description :

The Site Reliability Engineer (SRE) is a critical operations role requiring 6+ years of experience to manage and maintain the uptime, reliability, and readiness of Nvidias on-prem engineering cloud infrastructure spread across multiple data centers.

This is a full-time position available in Hyderabad, Pune, Chennai, and Bangalore, with a preference for Immediate Joiners.

The incumbent will focus heavily on guarding Service Level Agreements (SLAs), implementing robust monitoring, automation, and incident response.

Job Summary :

We are seeking a Senior SRE (6+ years experience) with mandatory expertise in on-prem infrastructure management and deep knowledge of the observability stack including Prometheus, Grafana, and the ELK Stack. The ideal candidate will be responsible for maintaining 24/7 uptime for critical engineering services, implementing comprehensive monitoring and alerting systems, and performing Root Cause Analysis (RCA). Core duties include automation using Jenkins, Python, Go, and Bash, managing baremetal data center machines (IPMI, Redfish), and actively participating in WAR rooms during critical incidents.

Key Responsibilities and Technical Deliverables :

Reliability and Service Level Management :

- Guard service level agreements (SLAs) for critical engineering services, implementing operational procedures to ensure adherence to defined performance targets.

- Implement monitoring, alerting, and incident response procedures to detect, diagnose, and mitigate service degradation or failure quickly.

- Perform root cause analysis (RCA) and post-mortems of all incidents for any threshold breaches, driving corrective actions and preventative measures.

- Actively participate in WAR rooms for critical issues, providing real-time technical expertise and resolution steps.

Observability and Pipeline Management :

- Set up and manage monitoring and logging tools such as Prometheus, Grafana, or the ELK Stack (Elasticsearch, Logstash, Kibana) to oversee system health and performance.

- Maintain KPI pipelines (Key Performance Indicator) using integration tools like Jenkins, Python, and ELK, ensuring accurate and timely reporting of service metrics.

- Improve monitoring systems by adding custom alerts based on specific business needs and observed patterns, reducing noise and increasing signal fidelity.

Infrastructure Management, Automation, and DevOps :

- Manage Nvidias on-prem infrastructure, maintaining uptime and reliability across multiple data centers.

- Handle Baremetal data center machine management tools like IPMI, Redfish, KVM for hardware lifecycle, provisioning, and remote access.

- Drive automation efforts using infrastructure tools like Kubernetes for orchestration and scripting languages such as Jenkins, Python, Go, and Bash to eliminate repetitive manual tasks.

- Help in capacity planning, optimization, and better utilization efforts for the engineering cloud resources.

- Support user reported issues and actively Monitor alerts and take necessary action during shift.

- Create and maintain detailed documentation for operational procedures, configurations, and troubleshooting guides.

Mandatory Skills & Qualifications :

- Experience : 6 + years of experience in SRE, DevOps, or highly technical infrastructure roles.

- Infrastructure Focus : Mandatory experience in On-prem infrastructure management and data center operations.

- Observability : Deep expertise in Prometheus, Grafana, and ELK Stack (Elasticsearch, Logstash, Kibana).

- Automation/Scripting : Proficiency in Jenkins, Python, Go, and Bash.

- Core Tools : Strong skills in Kubernetes and MySQL.

- Baremetal : Experience with Baremetal data center machine management tools like IPMI, Redfish, KVM.

- Process : Experience defining and maintaining SLAs and performing RCA/Post-mortems.

Preferred Skills :

- Experience managing high-performance computing (HPC) environments.

- Certification in Kubernetes (CKA) or cloud platforms.