About Roku :

The #1 platform for streaming television, Roku wants to revolutionize the way the world watches TV.

Our Roku-branded TVs, Roku TV models, Smart Home system, streaming players, audio equipment, and the purpose-built operating system that powers it all can turn any home into a home theater, with seamless integration of hardware and software.

Our commitment to our users extends to our brand studio, which creates innovative Roku Originals exclusively for The Roku Channel, a free channel that reaches approximately 80 million households in the U and Mexico.

Join us, and you'll have the chance to delight millions of TV streamers around the world while gaining meaningful experience across a variety of disciplines.

Job Description :

We are seeking a talented and experienced DevOps/SRE (Site Reliability Engineering) Team Lead to join our dynamic team.

The ideal candidate will have a strong background in DevOps practices, cloud infrastructure management, automation, and team leadership skills.

If you have a consistent track record architecting & building large-scale systems and enjoy solving intriguing system challenges at the internet scale, and If you are innovative at heart and have a great balance between learning, organizing, building, and enjoy making an impact, this role might be a great fit for you!

What you will be doing :

- Provide leadership and guidance to a team of DevOps/SRE engineers, fostering a collaborative and high-performing work environment.

- Mentor team members in best practices, technologies, and methodologies.

- Oversee the design, implementation, and maintenance of scalable and resilient cloud infrastructure on platforms spanning AWS and GCP.

- Ensure high availability, reliability, and performance of critical systems.

- Collaborate with your peers to be responsible for the entire software lifecycle, seek the right problem to solve, and strive for excellence.

- Manage individual project priorities, deadlines, and deliverables related to your technical expertise and assigned domains

- Lead incident response efforts, working closely with cross-functional teams to resolve issues quickly and minimize downtime.

- Implement effective incident management processes and post-incident reviews.

- Collaborate with security teams to ensure the integrity and security of infrastructure and applications.

- Implement security best practices and compliance standards.

- Identify performance bottlenecks and optimize system resources for maximum efficiency.

- Conduct regular performance tuning and capacity planning exercises.

- Drive continuous improvement initiatives within the team and across the organization.

- Proactively identify areas for enhancement and implement solutions to address them.

- Maintain comprehensive documentation of systems, processes, and procedures.

- Foster a culture of knowledge sharing and contribute to the collective learning of the team.

- Participate in 24x7 on-call rotation, and be available to work with global teams in the event of critical outages.

We're excited if you have :

- Experience with a number of the following : ECS, Docker, Kubernetes, Envoy, Istio.

- Experience with infrastructure as code (IaC) tools such as Terraform, Ansible, or CloudFormation.

- Strong understanding of distributed systems, microservices architecture, and cloud-native technologies.

- The drive and self-motivation to understand the intricate details of a complex infrastructure environment.

- 10+ years of experience in DevOps/SRE roles, with at least 2 years in a leadership capacity.

- Strong proficiency in cloud platforms such as AWS, Azure, or GCP.

- Solid understanding of networking, security, and compliance principles.

- Proven track record of driving results and delivering high-quality solutions in a fast-paced environment.

- Demonstrated ability to communicate clearly with both technical and non-technical project stakeholders, with the ability to work effectively in a cross-functional team environment.

- BS Degree in Computer Science or Equivalent.

- Certifications in relevant technologies such as Certified Kubernetes Administrator (CKA), AWS Certified DevOps Engineer, or Certified Information Systems Security Professional (CISSP).

- Certified Scrum Master is a plus