Posted on: 27/11/2025
Position Overview :
- Develop and maintain automation scripts using Python and Shell scripting to streamline operations and improve system efficiency.
- Manage and deploy containerized applications using Kubernetes and Docker, ensuring seamless orchestration and scalability.
- Implement and manage SRE monitoring tools (Datadog, Prometheus, Dynatrace) to proactively monitor system health, performance, and incidents.
- Collaborate with development and operations teams to design and implement reliable, scalable infrastructure.
- Perform root cause analysis (RCA) for production incidents and implement preventive measures.
- Optimize system performance, reduce latency, and improve fault tolerance.
- Contribute to on-call rotation for 24/7 production support.
Required Skills :
- Experience : Minimum 7 years of relevant experience in Site Reliability Engineering, DevOps, or production support roles.
- Proven expertise in production support, including incident management, troubleshooting,
and resolution in high-availability environments.
- Strong programming skills in Python and Shell scripting for automation and tooling.
- Hands-on experience with Kubernetes for container orchestration and Docker for
containerization.
- Proficiency in SRE monitoring tools such as Datadog, Prometheus, and Dynatrace for
observability and performance monitoring.
- Solid understanding of cloud infrastructure (AWS, Azure, or GCP) and CI/CD pipelines.
- Excellent problem-solving skills and ability to work under pressure in fast-paced
environments.
- Strong communication skills and ability to collaborate with cross-functional teams.
Preferred Qualifications :
- Familiarity with additional monitoring tools or log management platforms (e.g., ELK Stack,
Splunk).
- Certifications in Kubernetes (CKA/CKAD), cloud platforms, or SRE practices.
Did you find something suspicious?
Posted By
Posted in
DevOps / SRE
Functional Area
Site Reliability Engineering
Job Code
1581631
Interview Questions for you
View All