Posted on: 03/02/2026
Description :
About the Role :
The Senior Site Reliability Engineer for Cloud Platforms at T-Mobile is a hands-on technical Engineer responsible for ensuring the scalability, reliability, performance, and security of enterprise cloud environments across AWS, Azure, and GCP. They blend deep technical expertise in cloud infrastructure, automation, and observability with strong reliability engineering principles and a security-first mindset.
They also utilize their strong problem-solving and analytical skills to automate processes, reducing manual effort and preventing operational incidents. They design and implement resilient, automated cloud systems that support business-critical applications and services, driving excellence in uptime, performance, and operational efficiency.
By continuously learning new skills and technologies, they adapt to changing circumstances and drive innovation. Their work and expertise contribute significantly to the stability and performance of T-Mobile's digital infrastructure. They are also responsible for diagnosing and resolving complex issues across networking, storage, and compute layers, driving continuous improvement through data-driven insights and DevOps best practices.
This engineer is also responsible for contributing to the overall architecture and strategy of technical systems, mentoring junior engineers, and ensuring solutions are aligned with T-Mobile's business and technical goals.
What Youll Do :
- Design, build, and maintain highly available and scalable public cloud environments (AWS, Azure, GCP).
- Design, implement, and manage automated CI/CD deployment pipelines and infrastructure provisioning using Terraform, Ansible, or CloudFormation across AWS, Azure, and GCP.
- Ensure infrastructure and applications are optimized for cost, performance, security, governance and compliance across cloud environments.
- Lead incident response, root cause analysis (RCA), and post-mortems, implementing automation to prevent recurrence.
- Partner with Cloud Security teams to deliver compliant, resilient architectures aligned with FCC and enterprise standards. Also manage vulnerability scanning, patching, and policy enforcement.
- Implement and manage Identity and Access Management (IAM) policies, roles, and federations for secure access control. Integrate security-by-design into all platform layers : IAM, networking and secrets management
- Collaborate with Cloud Network engineers to implement and support VPC/VNet peering, transit gateways, and service endpoints. Additionally Troubleshoot connectivity between cloud, on-prem, and partner networks
- Monitor, analyze, and enhance system performance, proactively identifying bottlenecks and ensuring reliability through data-driven observability and capacity planning.
- Resolve platform-related customer tickets by diagnosing and addressing infrastructure, deployment, and performance issues to ensure reliability and seamless user experience.
- Troubleshoot complex issues across the full stack - network, storage, compute, and application layers.
- Drive a culture of automation, resilience, and continuous improvement, contributing to the evolution of T-Mobiles platform engineering and cloud infrastructure strategies.
- Drive innovation by recommending new technologies, frameworks, and tools.
- Perform additional duties and strategic projects as assigned.
What Youll Bring :
- Bachelors degree in computer science, Software Engineering, or related field.
- 5 - 8 years of proven expertise in multi-cloud environments (AWS, Azure, GCP) and SRE roles.
- Hands-on experience with observability tools (monitoring, alerting, logging, and tracing with tools like Prometheus, Grafana, ELK, Datadog, Splunk, Open Telemetry ) to maintain reliability and operational excellence.
- Familiarity with CNAPP platforms (e.g., Wiz, Prisma Cloud, Orca, etc.) and CSPM/CWPP principles.
- Proven experience implementing IaC using Terraform, CloudFormation, or ARM templates.
- Solid understanding of networking (TCP/IP, DNS, load balancing, VPNs, peering) and cloud security principles.
- Strong analytical thinking and collaborative problem-solving skills.
- Excellent communication and documentation abilities.
Must Have Skills :
- Infrastructure as Code (IaC) & Automation using Terraform, Ansible and CloudFormation, ARM Templates
- Strong hands-on experience in Cloud Platforms (AWS , Azure, GCP) and deep expertise in at least one major cloud provider.
- Terraform Or Ansible Or ARM Templates
- AWS Or Azure Or GCP
- Prometheus Or Grafana Or Splunk
- Site Reliability Or SRE
- Networking Or TCP/IP Or Load Balancing Or VPN Or Peering Or DNS
- IAM Or CNAPP Or WIZ Or PRISMA Or ORCA Or CSPM Or CWPP
- Python Or Bash
Did you find something suspicious?
Posted by
Posted in
DevOps / SRE
Functional Area
Site Reliability Engineering
Job Code
1609298