HamburgerMenu
hirist

Job Description

Description :

Role Summary :

We are seeking an experienced Manager Site Reliability Engineering (SRE) to lead a high-impact team responsible for the stability, availability, reliability, security, and performance of large-scale, distributed systems. This role combines technical leadership, people management, and operational excellence, driving automation, resilience, and engineering productivity across platforms and services.

You will lead initiatives across infrastructure reliability, application operations, service management, security posture, and engineering enablement, while partnering with product, engineering, and vendor teams to deliver highly reliable, scalable, and secure services.

Key Responsibilities :

SRE Leadership & Team Management :

- Build and lead high-performing SRE teams through strong organizational design, mentoring, and coaching

- Foster a culture of reliability engineering, automation-first thinking, and continuous improvement

- Provide technical guidance and subject matter expertise to team members

- Drive skills development, training programs, and career growth for SRE engineers

Reliability Engineering & Operations :

- Lead the design, implementation, and operation of reliable, scalable distributed systems

- Define and manage SLOs, SLIs, SLAs, error budgets, and reliability metrics

- Drive incident management, root cause analysis (RCA), and post-incident reviews

- Improve platform reliability, performance, resilience, and service availability

- Lead capacity planning, performance optimization, and service efficiency initiatives

Automation, Tooling & Platform Engineering :

- Drive self-service tooling, internal platforms, and engineering productivity frameworks

- Implement automation for infrastructure provisioning, monitoring, deployments, and operations

- Standardize tooling, processes, and operational frameworks across teams

- Improve CI/CD, observability, and operational workflows

Security & Risk Management :

- Lead cybersecurity initiatives including :

- Vulnerability management and remediation

- Secure configuration and hardening

- Security testing and validation

- Incident response and remediation

- Partner with security teams to implement technical controls and compliance standards

Cross-Functional Collaboration :

- Partner with engineering, product, architecture, and vendor teams to :

- Ensure reliable software releases

- Improve production quality and engineering productivity

- Drive operational excellence and service maturity

- Support release management, deployment strategies, and production readiness

Governance & Operational Excellence :

- Drive change management, risk management, and service governance programs

- Develop and optimize processes, procedures, and operational frameworks

- Support departmental planning, budgeting, and operational efficiency initiatives

Required Skills & Experience :

- 10+ years of experience in SRE, DevOps, Infrastructure Engineering, or Platform Engineering

- Strong expertise in :

- Distributed systems architecture

- Cloud platforms (AWS/Azure/GCP or hybrid environments)

- Linux/Unix systems

- Networking fundamentals

Deep understanding of :

- Observability (monitoring, logging, tracing, alerting)

- Incident management & reliability engineering practices

- CI/CD pipelines and automation

- Infrastructure as Code (IaC)

- Strong background in systems design, performance tuning, and scalability

- Experience managing production systems at scale

- Proven people leadership and team management experience

Technical Stack (Typical Environment) :

- Cloud : AWS / Azure / GCP

- Containers & Orchestration : Docker, Kubernetes

- CI/CD : Jenkins, GitHub Actions, GitLab CI

- IaC : Terraform, CloudFormation, Ansible

- Observability : Prometheus, Grafana, ELK, Datadog, Splunk

- Scripting/Programming : Python, Go, Bash

- Version Control : Git

Leadership & Behavioral Competencies :

- Strong decision-making and judgment

- Excellent stakeholder management and communication

- Ability to operate in high-pressure, production-critical environments

- Strategic mindset with hands-on technical depth

- Strong collaboration and influence skills

Work Requirements :

- Participation in on-call leadership rotations

- Ability to support 24/7 production environments

- Flexible scheduling, including nights/weekends when required for critical operations

Why This Role Matters :

This role is critical to ensuring platform reliability, service resilience, engineering velocity, and customer trust. You will shape the reliability strategy, define operational excellence standards, and build the foundation for scalable, secure, and high-performing digital platforms.

info-icon

Did you find something suspicious?

Similar jobs that you might be interested in