HamburgerMenu
hirist

Lead/Senior Site Reliability Engineer - Observability Services

Searce
5 - 12 Years
Multiple Locations

Posted on: 21/04/2026

Job Description

Lead Cloud Reliability Engineer

Job Responsibilities :

- Lead and manage the Cloud Reliability teams to provide strong Managed Services support to end-customers.

- Isolate, troubleshoot and resolve issues reported by CMS clients in their cloud environment

- Drive the communication with the customer providing details about the issue, current steps, next plan of action, ETA

- Gather client's requirements related to use of specic cloud services and provide assistance in seing them up and resolving issues

- Create SOPs and knowledge articles for use by the L1 teams to resolve common issues

- Identify recurring issues, perform root cause analysis and propose/implement preventive actions

- Follow change management procedure to identify, record and implement changes

- Plan and deploy OS, security patches in Windows/Linux environment and upgrade k8s clusters

- Identify the recurring manual activities and contribute to automation

- Provide technical guidance and educate team members on development and operations. Monitor metrics and develop ways to improve.

- System troubleshooting and problem-solving across plaorm and application domains. Ability to use a wide variety of open-source technologies and cloud services.

- Build, maintain, and monitor conguration standards.

- Ensuring critical system security through using best-in-class cloud security solutions.

Qualifications :

- 4-7 years experience in Cloud Infrastructure and Operations domains and IT operational experience preferably in a global enterprise environment.

- Specialize in one or two cloud deployment platforms: AWS, GCP

- Hands on experience with AWS/GCP services (EKS, ECS, EC2, VPC, RDS, Lambda, GKE, Compute Engine)

- Understanding of one or more programming languages (Python, JavaScript, Ruby, Java, .Net)

- Logging and Monitoring tools (ELK, Stackdriver, CloudWatch)

- Knowledge on Conguration Management tools such as Ansible, Terraform,Puppet, Chef

- Experience working with deployment and orchestration technologies (such as Docker, Kubernetes, Mesos)

- Good analytical, communication, problem solving, and learning skills.

- Knowledge on programming against cloud plaorms such as Google Cloud Platform and lean development methodologies.

- Strong service aitude and a commitment to quality.

- Willingness to work in shifts.

info-icon

Did you find something suspicious?

Similar jobs that you might be interested in