Description :

About the Role :

We are seeking a Datacenter Engineer - Incident Management to ensure the stability, availability, and rapid recovery of mission-critical datacenter infrastructure.

This role focuses on real-time incident detection, troubleshooting, coordination, and resolution across compute, network, storage, power, and cooling systems.

The ideal candidate is highly operational, calm under pressure, and experienced in managing incidents in 247 datacenter environments, ensuring minimal downtime and strong SLA adherence.

Key Responsibilities :

- Act as the first point of contact for datacenter incidents and service disruptions.

- Monitor infrastructure alarms, alerts, and system dashboards to proactively identify issues.

- Perform rapid triage, root cause identification, and immediate remediation of incidents.

- Lead incident bridges and coordinate with internal teams and external vendors during outages.

- Ensure timely escalation as per incident severity and escalation matrix.

- Drive incident resolution within defined SLAs and OLAs.

- Support day-to-day datacenter operations including servers, networking, storage, and virtualization platforms.

- Perform hands-on troubleshooting of hardware failures (servers, switches, firewalls, storage devices).

- Coordinate break-fix activities, component replacements, and vendor RMA processes.

- Support power, cooling, and environmental systems in collaboration with facilities teams.

- Conduct detailed root cause analysis (RCA) for major incidents.

- Prepare and publish incident reports, timelines, and corrective action plans.

- Identify recurring issues and recommend preventive measures.

- Drive implementation of permanent fixes and process improvements.

- Use monitoring tools (e.g., Nagios, Zabbix, SolarWinds, Datadog, or similar) to detect and analyze incidents.

- Improve alert quality by reducing noise and ensuring actionable notifications.

- Participate in automation initiatives to reduce manual intervention and incident frequency.

- Maintain and update SOPs, runbooks, and incident response playbooks.

- Support change management activities to minimize operational risk.

- Review changes for potential impact on datacenter availability.

- Identify infrastructure risks and proactively suggest mitigation plans.

- Collaborate with problem management teams to resolve long-term issues.

- Ensure adherence to operational standards, safety policies, and compliance requirements.

- Maintain accurate documentation for incidents, resolutions, and infrastructure changes.

- Support audits related to datacenter operations and availability.

- Work closely with NOC, SOC, Network, Server, Cloud, and Facilities teams.

- Communicate incident status, impact, and resolution clearly to stakeholders.

- Participate in on-call rotations and provide 247 operational support as required.

Requirements :

Experience & Technical Skills :

- 3-6+ years of experience in datacenter operations, NOC, or incident management roles.

- Strong understanding of datacenter infrastructure :

1. Servers (Dell, HPE, Lenovo, etc.)

2. Networking (routers, switches, firewalls)

3. Storage (SAN, NAS)

4. Virtualization (VMware, Hyper-V, KVM)

- Hands-on experience managing and resolving P1/P2 incidents.

- Familiarity with ITIL-based incident, problem, and change management processes.

Monitoring & Tools :

- Experience with infrastructure monitoring and alerting tools.

- Working knowledge of ticketing systems (ServiceNow, Remedy, Jira Service Management).

- Ability to analyze logs, metrics, and alerts to troubleshoot issues quickly