Description :
Key Responsibilities :
- Primarily accountable for managing Azure environments.
- Design, implement and maintain highly available, scalable, and resilient infrastructure.
- Identify, optimize and eliminate performance bottlenecks and proactively remediating security concerns through monitoring, profiling, and tuning.
- Establish and improve SLOs, SLIs, and error budgets to drive system reliability.
- Collaborate with stakeholders, including application developers, to improve application observability and optimize performance.
- Lead and mentor a team of engineers working to reduce toil across the total team load, and to implement security features, roles, user access and privileges according to best practices.
- Proactively identify, design, and implement process and architectural improvements.
- Stay informed on the latest features and best practices across the Azure Public Cloud and the WBD Azure environment.
- Work with peer group of complementary public cloud leads (AWS/GCP) to facilitate consistency across WBD management of resources wherever possible.
Methodology :
- Automate deployment, monitoring, and self-healing capabilities to improve operational efficiency.
- Develop and manage infrastructure using Terraform and other IaC tools.
- Drive incident response efforts, conduct root cause analyses (RCA), and implement preventative measures to minimize downtime.
- Build and enhance monitoring, alerting, and observability systems to proactively resolve incidents before they impact users.
- Evangelize telemetry and metrics-driven application development.
- Improve on-call processes and reduce toil by automating repetitive tasks.
- Contribute to the software development of cloud management tooling and support applications.
- Develop detailed technical documentation, including runbooks, troubleshooting guides, and system diagrams.
Continuous Improvement :
- Work with stakeholders to ensure systems meet security baselines, best practices, compliance requirements and resiliency standards.
- Implement effective backup strategies and conduct regular disaster recovery testing.
- Implement robust access controls, secrets management, and security monitoring solutions.
- Collaborate with security teams to manage vulnerabilities and respond to threats.
- Engage with our FinOps/CostOps team to optimize cloud costs by implementing efficient resource utilization and right-sizing strategies.
- Work closely with development, infrastructure, and security teams to drive best practices and improvements.
- Mentor junior engineers and contribute to a culture of continuous learning and improvement.
- Participate in architectural discussions and provide guidance on reliability and scalability considerations.
Qualifications & Experiences :
- 8+ years of prior experience in a Site Reliability Engineering, DevOps, Cloud Infrastructure or related fields.
- Expert in Microsoft Azure cloud.
- Minimum of 5+ years of hands-on experience architecting, building and managing Azure tenants, management groups and the overall Azure control plane and its contents.
- Demonstrable experience in Linux/Unix and Windows Server administration, networking, and distributed systems.
- Fluency in two or more programming languages (PowerShell, Python, Golang, Javascript, etc.)
- Extensive hands-on experience in container orchestration technologies, such as AKS, Kubernetes, Docker.
- Deep knowledge of monitoring, logging and observability tools (Prometheus, Grafana, ELK, Splunk, etc.)
- Hands-on experience with Infrastructure-as-Code (IaC) using Terraform and ARM templates.
- Strong background in CI/CD pipelines, GitOps, and infrastructure automation (Terraform, Helm, Ansible or Chef).
Soft Skills :
- Strong problem-solving, troubleshooting, and debugging skills.
- Excellent written and verbal communication and collaboration abilities.
- English language fluency required.
- Ability to handle multiple assignments concurrently.
- Passion for automation, reliability, and continuous improvement.
- Move quickly and intelligently - seeing technical debt as your nemesis.
- Ability to solve problems independently but knows when to request assistance.
Not Required but preferred experience :
- Experience with other cloud providers such as AWS, Google Cloud Platform (GCP), Oracle etc.
- Knowledge of and passion for media, entertainment, and technology industries (including key players, growth trends and drivers, new media models, industry structure, etc.)
- Familiarity with streaming and similar products/services.
- Experience working in a national or global company.
- Comfortable working in a highly iterative and somewhat unstructured environment