- Primarily accountable for managing Azure environments.

- Design, implement and maintain highly available, scalable, and resilient infrastructure.

- Identify, optimize and eliminate performance bottlenecks and proactively remediating security concerns through monitoring, profiling, and tuning.

- Establish and improve SLOs, SLIs, and error budgets to drive system reliability.

- Collaborate with stakeholders, including application developers, to improve application observability and optimize performance.

- Lead and mentor a team of engineers working to reduce toil across the total team load, and to implement security features, roles, user access and privileges according to best practices.

- Proactively identify, design, and implement process and architectural improvements.

- Stay informed on the latest features and best practices across the Azure Public Cloud and the WBD Azure environment.

- Work with peer group of complementary public cloud leads (AWS/GCP) to facilitate consistency across WBD management of resources wherever possible.

Methodology :

- Automate deployment, monitoring, and self-healing capabilities to improve operational efficiency.

- Develop and manage infrastructure using Terraform and other IaC tools.

- Drive incident response efforts, conduct root cause analyses (RCA), and implement preventative measures to minimize downtime.

- Build and enhance monitoring, alerting, and observability systems to proactively resolve incidents before they impact users.

- Evangelize telemetry and metrics-driven application development.

- Improve on-call processes and reduce toil by automating repetitive tasks.

- Contribute to the software development of cloud management tooling and support applications.

- Develop detailed technical documentation, including runbooks, troubleshooting guides, and system diagrams.

Continuous Improvement :

- Work with stakeholders to ensure systems meet security baselines, best practices, compliance requirements and resiliency standards.

- Implement effective backup strategies and conduct regular disaster recovery testing.

- Implement robust access controls, secrets management, and security monitoring solutions.

- Collaborate with security teams to manage vulnerabilities and respond to threats.

- Engage with our FinOps/CostOps team to optimize cloud costs by implementing efficient resource utilization and right-sizing strategies.

- Work closely with development, infrastructure, and security teams to drive best practices and improvements.

- Mentor junior engineers and contribute to a culture of continuous learning and improvement.

- Participate in architectural discussions and provide guidance on reliability and scalability considerations.

Qualifications & Experiences :

- 8+ years of prior experience in a Site Reliability Engineering, DevOps, Cloud Infrastructure or related fields.

- Expert in Microsoft Azure cloud.

- Minimum of 5+ years of hands-on experience architecting, building and managing Azure tenants, management groups and the overall Azure control plane and its contents.

- Demonstrable experience in Linux/Unix and Windows Server administration, networking, and distributed systems.

- Fluency in two or more programming languages (PowerShell, Python, Golang, Javascript, etc.)

- Extensive hands-on experience in container orchestration technologies, such as AKS, Kubernetes, Docker.

- Deep knowledge of monitoring, logging and observability tools (Prometheus, Grafana, ELK, Splunk, etc.)

- Hands-on experience with Infrastructure-as-Code (IaC) using Terraform and ARM templates.

- Strong background in CI/CD pipelines, GitOps, and infrastructure automation (Terraform, Helm, Ansible or Chef).

Soft Skills :

- Strong problem-solving, troubleshooting, and debugging skills.

- Excellent written and verbal communication and collaboration abilities.

- English language fluency required.

- Ability to handle multiple assignments concurrently.

- Passion for automation, reliability, and continuous improvement.

- Move quickly and intelligently - seeing technical debt as your nemesis.

- Ability to solve problems independently but knows when to request assistance.

Not Required but preferred experience :

- Experience with other cloud providers such as AWS, Google Cloud Platform (GCP), Oracle etc.

- Knowledge of and passion for media, entertainment, and technology industries (including key players, growth trends and drivers, new media models, industry structure, etc.)

- Familiarity with streaming and similar products/services.

- Experience working in a national or global company.

- Comfortable working in a highly iterative and somewhat unstructured environment