Posted on: 07/10/2025
About Roku :
The #1 platform for streaming television, Roku wants to revolutionize the way the world watches TV.
Our Roku-branded TVs, Roku TV models, Smart Home system, streaming players, audio equipment, and the purpose-built operating system that powers it all can turn any home into a home theater, with seamless integration of hardware and software.
Our commitment to our users extends to our brand studio, which creates innovative Roku Originals exclusively for The Roku Channel, a free channel that reaches approximately 80 million households in the U and Mexico.
Join us, and you'll have the chance to delight millions of TV streamers around the world while gaining meaningful experience across a variety of disciplines.
Job Description :
We are seeking a talented and experienced DevOps/SRE (Site Reliability Engineering) Team Lead to join our dynamic team.
The ideal candidate will have a strong background in DevOps practices, cloud infrastructure management, automation, and team leadership skills.
If you have a consistent track record architecting & building large-scale systems and enjoy solving intriguing system challenges at the internet scale, and If you are innovative at heart and have a great balance between learning, organizing, building, and enjoy making an impact, this role might be a great fit for you!
What you will be doing :
- Provide leadership and guidance to a team of DevOps/SRE engineers, fostering a collaborative and high-performing work environment.
- Mentor team members in best practices, technologies, and methodologies.
- Oversee the design, implementation, and maintenance of scalable and resilient cloud infrastructure on platforms spanning AWS and GCP.
- Ensure high availability, reliability, and performance of critical systems.
- Collaborate with your peers to be responsible for the entire software lifecycle, seek the right problem to solve, and strive for excellence.
- Manage individual project priorities, deadlines, and deliverables related to your technical expertise and assigned domains
- Lead incident response efforts, working closely with cross-functional teams to resolve issues quickly and minimize downtime.
- Implement effective incident management processes and post-incident reviews.
- Collaborate with security teams to ensure the integrity and security of infrastructure and applications.
- Implement security best practices and compliance standards.
- Identify performance bottlenecks and optimize system resources for maximum efficiency.
- Conduct regular performance tuning and capacity planning exercises.
- Drive continuous improvement initiatives within the team and across the organization.
- Proactively identify areas for enhancement and implement solutions to address them.
- Maintain comprehensive documentation of systems, processes, and procedures.
- Foster a culture of knowledge sharing and contribute to the collective learning of the team.
- Participate in 24x7 on-call rotation, and be available to work with global teams in the event of critical outages.
We're excited if you have :
- Experience with a number of the following : ECS, Docker, Kubernetes, Envoy, Istio.
- Experience with infrastructure as code (IaC) tools such as Terraform, Ansible, or CloudFormation.
- Strong understanding of distributed systems, microservices architecture, and cloud-native technologies.
- The drive and self-motivation to understand the intricate details of a complex infrastructure environment.
- 10+ years of experience in DevOps/SRE roles, with at least 2 years in a leadership capacity.
- Strong proficiency in cloud platforms such as AWS, Azure, or GCP.
- Solid understanding of networking, security, and compliance principles.
- Proven track record of driving results and delivering high-quality solutions in a fast-paced environment.
- Demonstrated ability to communicate clearly with both technical and non-technical project stakeholders, with the ability to work effectively in a cross-functional team environment.
- BS Degree in Computer Science or Equivalent.
- Certifications in relevant technologies such as Certified Kubernetes Administrator (CKA), AWS Certified DevOps Engineer, or Certified Information Systems Security Professional (CISSP).
- Certified Scrum Master is a plus
Did you find something suspicious?
Posted By
Posted in
DevOps / SRE
Functional Area
Site Reliability Engineering
Job Code
1556722
Interview Questions for you
View All