Posted on: 21/04/2026
Lead Cloud Reliability Engineer
Job Responsibilities :
- Lead and manage the Cloud Reliability teams to provide strong Managed Services support to end-customers.
- Isolate, troubleshoot and resolve issues reported by CMS clients in their cloud environment
- Drive the communication with the customer providing details about the issue, current steps, next plan of action, ETA
- Gather client's requirements related to use of specic cloud services and provide assistance in seing them up and resolving issues
- Create SOPs and knowledge articles for use by the L1 teams to resolve common issues
- Identify recurring issues, perform root cause analysis and propose/implement preventive actions
- Follow change management procedure to identify, record and implement changes
- Plan and deploy OS, security patches in Windows/Linux environment and upgrade k8s clusters
- Identify the recurring manual activities and contribute to automation
- Provide technical guidance and educate team members on development and operations. Monitor metrics and develop ways to improve.
- System troubleshooting and problem-solving across plaorm and application domains. Ability to use a wide variety of open-source technologies and cloud services.
- Build, maintain, and monitor conguration standards.
- Ensuring critical system security through using best-in-class cloud security solutions.
Qualifications :
- 4-7 years experience in Cloud Infrastructure and Operations domains and IT operational experience preferably in a global enterprise environment.
- Specialize in one or two cloud deployment platforms: AWS, GCP
- Hands on experience with AWS/GCP services (EKS, ECS, EC2, VPC, RDS, Lambda, GKE, Compute Engine)
- Understanding of one or more programming languages (Python, JavaScript, Ruby, Java, .Net)
- Logging and Monitoring tools (ELK, Stackdriver, CloudWatch)
- Knowledge on Conguration Management tools such as Ansible, Terraform,Puppet, Chef
- Experience working with deployment and orchestration technologies (such as Docker, Kubernetes, Mesos)
- Good analytical, communication, problem solving, and learning skills.
- Knowledge on programming against cloud plaorms such as Google Cloud Platform and lean development methodologies.
- Strong service aitude and a commitment to quality.
- Willingness to work in shifts.
Did you find something suspicious?
Posted by
Posted in
DevOps / SRE
Functional Area
Site Reliability Engineering
Job Code
1629898