HamburgerMenu
hirist

Site Reliability Engineer - Docker/Kubernetes

ARMPL
Multiple Locations
5 - 13 Years

Posted on: 28/11/2025

Job Description

Description :

Site Reliability Engineer (SRE)

Type : Full time


Job Description :


- Design, implement, and maintain Infrastructure as Code (IaC) to ensure consistency, scalability, and repeatability.

- Streamline release and deployment workflows, ensuring smooth, structured, and predictable releases.

- Enhance observability, tracing, and monitoring across systems for improved performance and reliability.

- Drive automation initiatives across infrastructure, access management, and database operations.

- Optimize cloud resource utilization and costs, ensuring efficient use of compute, storage, and database resources.

- Improve and maintain CI/CD pipelines, ensuring faster, safer, and more reliable deployments.

- You hold the production systems together; troubleshoot issues that arise in production deployment

- Provide 24x7 coverage as a part of scheduled shift and on-call rotation

- Work with multiple tools like Prometheus, Grafana, Jira etc. to monitor, manage, triage and document infrastructure issues in real time

- Automate infrastructure deployment using CI/CD

- Build necessary tools to evolve how we maintain and monitor our solution

- Develop and execute system and integration test plans

- Collaborate closely with engineering teams to ensure infrastructure supports evolving application and data needs.

- Collaborate with product engineering teams to design and build the infrastructure their services run on.

- Keep our Kubernetes clusters on AWS EKS running smoothly, secure, and ready to scale.

- Design and deliver resilience strategies that cover multi-region architecture, backups, disaster recovery, and failover.

- Automate infrastructure with Terraform and Infrastructure-as-Code, reducing manual effort and human error.

- Help teams ship faster by improving CI/CD pipelines and deployment practices.

- Monitor performance and reliability using modern observability tools.

- Support on-call rotations and lead incident response with a focus on long-term fixes.

Requirements :

- At least 5+ years experience in management of production systems

- Self starter and a solution oriented mindset. You see potential challenges as opportunities to learn and grow

- Experience with cloud providers, AWS, Azure or GCP

- Experience with computer networking and network technologies

- Experience with CI/CD pipelines such as Concourse-CI, Jenkins.

- Experience with Kubernetes

- Excellent problem-solving skills and ability to quickly grasp new concepts


info-icon

Did you find something suspicious?