Posted on: 28/11/2025
Description :
Site Reliability Engineer (SRE)
Type : Full time
Job Description :
- Design, implement, and maintain Infrastructure as Code (IaC) to ensure consistency, scalability, and repeatability.
- Streamline release and deployment workflows, ensuring smooth, structured, and predictable releases.
- Enhance observability, tracing, and monitoring across systems for improved performance and reliability.
- Drive automation initiatives across infrastructure, access management, and database operations.
- Optimize cloud resource utilization and costs, ensuring efficient use of compute, storage, and database resources.
- Improve and maintain CI/CD pipelines, ensuring faster, safer, and more reliable deployments.
- You hold the production systems together; troubleshoot issues that arise in production deployment
- Provide 24x7 coverage as a part of scheduled shift and on-call rotation
- Work with multiple tools like Prometheus, Grafana, Jira etc. to monitor, manage, triage and document infrastructure issues in real time
- Automate infrastructure deployment using CI/CD
- Build necessary tools to evolve how we maintain and monitor our solution
- Develop and execute system and integration test plans
- Collaborate closely with engineering teams to ensure infrastructure supports evolving application and data needs.
- Collaborate with product engineering teams to design and build the infrastructure their services run on.
- Keep our Kubernetes clusters on AWS EKS running smoothly, secure, and ready to scale.
- Design and deliver resilience strategies that cover multi-region architecture, backups, disaster recovery, and failover.
- Automate infrastructure with Terraform and Infrastructure-as-Code, reducing manual effort and human error.
- Help teams ship faster by improving CI/CD pipelines and deployment practices.
- Monitor performance and reliability using modern observability tools.
- Support on-call rotations and lead incident response with a focus on long-term fixes.
Requirements :
- At least 5+ years experience in management of production systems
- Self starter and a solution oriented mindset. You see potential challenges as opportunities to learn and grow
- Experience with cloud providers, AWS, Azure or GCP
- Experience with computer networking and network technologies
- Experience with CI/CD pipelines such as Concourse-CI, Jenkins.
- Experience with Kubernetes
- Excellent problem-solving skills and ability to quickly grasp new concepts
Did you find something suspicious?
Posted By
Posted in
DevOps / SRE
Functional Area
Site Reliability Engineering
Job Code
1581914
Interview Questions for you
View All