Posted on: 01/04/2026
Description :
Site Reliability Engineer : (We are hiring across Xebia locations : Gurugram, Hyderabad, Bhopal, Chennai, Bengaluru, Jaipur, Pune.)
Ready to take ownership of world-class cloud reliability and engineering excellence? If you love building and optimizing cloud platforms, solving complex distributed-system challenges, and empowering engineering teams through automation and reliability practices, this role gives you the scope and autonomy to make a real technical impact.
What You'll Be Doing :
As a Site Reliability Engineer, you'll help shape and strengthen cloud infrastructure, reliability engineering practices, and operational excellence. You'll work hands-on across AWS, container orchestration, observability platforms, and CI/CD ecosystems to ensure our systems are resilient, secure, and optimized for scale.
Responsibilities :
- Drive architectural and technical decision-making, ensuring infrastructure and platform designs support long-term scalability, reliability, and security.
- Partner with Delivery to plan and prioritize platform and infrastructure work for maximum technical and operational impact.
- Mentor engineers and uplift technical capability, championing strong engineering practices and continuous improvement.
- Shape technical strategy by contributing to architectural roadmaps, standards, and patterns balancing innovation with long-term risk and resilience.
- Embed quality, security, performance, and compliance into all engineering designs, processes, and operational workflows, ensuring reliability at scale.
Interested ? Here's what you'll need to be successful :
- 5+ years experience in a DevOps or SRE role, ideally within AWS-based environments.
- Strong proficiency with AWS CDK and Infrastructure as Code to deploy and optimize cloud infrastructure.
- Hands-on experience with Docker and container orchestration such as Kubernetes (EKS) or Amazon ECS.
- Proven experience building and maintaining CI/CD pipelines using GitLab, Jenkins, or similar tooling.
- Deep knowledge of monitoring, observability, and logging tools such as Prometheus, Grafana, AppDynamics,, and OpenSearch.
- Proficiency in Python, TypeScript, or Java for building automation, tooling, and reliability improvements.
- Solid understanding of cloud security, including WAF, patching, vulnerability management, and AWS Shield.
- Working knowledge of message queues and streaming technologies such as RabbitMQ, Kinesis, or Kafka.
- Strong analytical and operational problem-solving skills, with the ability to identify performance constraints, eliminate single points of failure, and scale distributed systems.
- Experience participating in incident response, including root-cause analysis and driving long-term reliability improvements.
- Excellent communication and collaboration skills to work effectively across architecture, delivery, and engineering teams.
Focus on :
- AWS CDK preferably with typescript.
- Monitoring stack with Open telemetry.
- API gateway technology.
- OS : Linux skills
- CI/CD using Jenkins and gitlab.
Did you find something suspicious?
Posted by
Eshita Porwal
NA at Xebia IT Architects India Pvt Ltd
Last Active: NA as recruiter has posted this job through third party tool.
Posted in
DevOps / SRE
Functional Area
Site Reliability Engineering
Job Code
1625285