HamburgerMenu
hirist

Job Description

Position Overview :


We are seeking a highly skilled Site Reliability Engineer (SRE) with 7 years of experience to join our dynamic team. The ideal candidate will have extensive expertise in production support, Python/Shell scripting, Kubernetes, Docker, and SRE monitoring tools such as Datadog, Prometheus, and Dynatrace. This role focuses on ensuring the reliability, scalability, and performance of our systems while supporting critical production environments.

Key Responsibilities :


- Provide production support for mission-critical applications, ensuring high availability and rapid issue resolution.


- Develop and maintain automation scripts using Python and Shell scripting to streamline operations and improve system efficiency.

- Manage and deploy containerized applications using Kubernetes and Docker, ensuring seamless orchestration and scalability.

- Implement and manage SRE monitoring tools (Datadog, Prometheus, Dynatrace) to proactively monitor system health, performance, and incidents.

- Collaborate with development and operations teams to design and implement reliable, scalable infrastructure.

- Perform root cause analysis (RCA) for production incidents and implement preventive measures.

- Optimize system performance, reduce latency, and improve fault tolerance.

- Contribute to on-call rotation for 24/7 production support.

Required Skills :

- Experience : Minimum 7 years of relevant experience in Site Reliability Engineering, DevOps, or production support roles.

- Proven expertise in production support, including incident management, troubleshooting,

and resolution in high-availability environments.

- Strong programming skills in Python and Shell scripting for automation and tooling.

- Hands-on experience with Kubernetes for container orchestration and Docker for

containerization.

- Proficiency in SRE monitoring tools such as Datadog, Prometheus, and Dynatrace for

observability and performance monitoring.

- Solid understanding of cloud infrastructure (AWS, Azure, or GCP) and CI/CD pipelines.

- Excellent problem-solving skills and ability to work under pressure in fast-paced

environments.

- Strong communication skills and ability to collaborate with cross-functional teams.

Preferred Qualifications :


- Experience with Infrastructure as Code (IaC) tools like Terraform or Ansible.

- Familiarity with additional monitoring tools or log management platforms (e.g., ELK Stack,

Splunk).

- Certifications in Kubernetes (CKA/CKAD), cloud platforms, or SRE practices.


info-icon

Did you find something suspicious?