Posted on: 24/07/2025
About the Job :
We're looking for a highly skilled and self-driven Site Reliability Engineer (SRE-2) to join our team in Hyderabad. This is a full-time, work-from-office role (5 days a week) perfect for someone with 8-12 years of experience who thrives on challenges and is passionate about building robust, scalable, and highly available systems.
You'll play a crucial role in ensuring the reliability, performance, and efficiency of our critical infrastructure and applications, with a particular focus on Kubernetes, DevOps, and observability. If you have hands-on experience with ML applications, GPU optimization, and Big Data systems, you'll be an ideal fit.
Key Responsibilities :
As a Site Reliability Engineer (SRE-2), you will :
- Design, deploy, and manage highly available and scalable Kubernetes clusters and robust DevOps pipelines.
- Troubleshoot and resolve complex infrastructure and application issues across various environments.
- Implement, maintain, and enhance comprehensive observability solutions, with a strong emphasis on Thanos and related monitoring and alerting tools.
- Provide expert support for machine learning (ML) workflows, leveraging tools like MLflow and Kubeflow.
- Optimize applications to maximize performance in GPU-accelerated environments.
- Contribute individually to projects and proactively learn and adopt new technologies to stay ahead of industry trends.
- Automate repetitive tasks and streamline operational processes using a diverse set of scripting and automation tools including Python, Ansible, Groovy, and Shell scripting.
Qualifications :
To be successful in this role, you should have :
- Strong, hands-on experience with Kubernetes and a deep understanding of core DevOps principles and tools.
- Proven expertise in observability and monitoring solutions, with a strong preference for experience with Thanos.
- Demonstrable experience working with ML platforms and optimizing applications for GPU-based environments.
- CKS (Certified Kubernetes Security Specialist) certification is preferred.
- Experience with Big Data systems is a significant plus.
- Proficiency in multiple scripting and automation languages : Python, Ansible, Groovy, and Shell scripting.
- Hands-on experience with CI/CD tools such as Jenkins, Ansible, and ArgoCD.
Did you find something suspicious?
Posted By
Rohit Nanduri
Senior Talent Acquisition Consultant at COFFEEBEANS CONSULTING LLP
Last Active: 25 Jul 2025
Posted in
DevOps / SRE
Functional Area
Site Reliability Engineering
Job Code
1518705
Interview Questions for you
View All