Posted on: 09/01/2026
Key Responsibilities :
- Manage and maintain Kubernetes deployments, including debugging, troubleshooting, and scaling applications.
- Monitor systems and applications using Prometheus and Grafana, creating dashboards and alerts for proactive issue detection.
- Handle production incidents with minimal supervision and ensure high system reliability and availability.
- Write automation scripts using Shell, Python, or Ansible to improve operational efficiency.
- Collaborate with development and operations teams to optimize CI/CD pipelines and implement DevOps best practices.
- Provide support for cloud-native environments, with exposure to AWS, Azure, or OpenStack.
- Participate in ARGO/ARGO WORKFLOW setup, maintenance, and troubleshooting.
- Ensure adherence to SRE best practices, including monitoring, alerting, and capacity planning.
- Document system configurations, incident reports, and operational procedures.
Qualifications :
- Bachelor's degree in Computer Science, Information Technology, or a related field (or equivalent experience).
- 3-5 years of experience in a DevOps, SRE, or similar support engineering role.
- Strong expertise in Linux system administration.
- Hands-on experience with Kubernetes deployments and troubleshooting.
- Proficiency in Prometheus and Grafana for monitoring and dashboards.
- Strong scripting and automation skills (Shell, Python, Ansible).
- Exposure to ARGO / ARGO WORKFLOW is mandatory.
- Basic knowledge of cloud computing environments and networking fundamentals.
- Ability to work independently and manage production incidents effectively.
- Excellent troubleshooting, analytical, and problem-solving skills.
Preferred Qualifications :
- Certification in Kubernetes (CKA, CKAD) is a plus.
- Experience with CI/CD pipelines and DevOps automation.
- Exposure to cloud providers such as AWS, Azure, or OpenStack.
- Strong understanding of networking fundamentals in cloud-native environments
Did you find something suspicious?
Posted by
Posted in
DevOps / SRE
Functional Area
Site Reliability Engineering
Job Code
1599385