HamburgerMenu
hirist

Job Description

We are seeking a strategic and operationally strong Command Center / Site Reliability Manager to lead our global incident response and network operations functions.


This leadership role is responsible for driving operational excellence, leading a high-performing team, and ensuring the resilience and reliability of our production systems and services.


You will lead the team responsible for 24x7 incident detection, escalation, communication, and resolution of critical service outages while overseeing real-time monitoring and triage of infrastructure and application health.


Responsibilities :


- Lead end-to-end management of Critical Service Outages (P0/P1 incidents), driving timely resolution through coordinated incident response, effective communication with stakeholders, and robust post-incident reviews with actionable remediation.

- Oversee a 24x7 Network Operations Center (NOC), implementing scalable observability, alerting, and monitoring strategies to ensure infrastructure, application, and network reliability.


- Continuously optimize alert triage, diagnostics, and noise reduction to boost efficiency.

- Build and develop a high-performing team of incident managers, NOC engineers, and shift leads.


- Foster operational maturity through training, performance management, and close collaboration with Engineering, SRE, DevOps, and Product teams.

- Define and uphold standards for incident SLAs, escalation processes, runbooks, and playbooks, while ensuring continuous shift coverage, smooth handoffs, and comprehensive KPI reporting on system health and incident trends.


Requirements :


- 6+ years of experience in Technical Operations, Site Reliability, NOC, or Incident Management roles.

- 2+ years in a people management or team leadership role.

- Deep knowledge of major incident management, escalation practices, and real-time service recovery strategies.

- Strong technical understanding of cloud-native architectures (AWS, Azure, GCP), infrastructure monitoring, and DevOps practices.

- Proven experience working with observability tools (e. g., Datadog, Splunk, Grafana, Prometheus), incident tools (PagerDuty), and ITSM platforms (e. g., ServiceNow, Jira).

- Prior experience supporting high-availability SaaS or telecommunications systems is a strong plus.

- Experience with customer-facing incident communication practices.


info-icon

Did you find something suspicious?