We're looking for a highly skilled and self-driven Site Reliability Engineer (SRE-2) to join our team in Hyderabad. This is a full-time, work-from-office role (5 days a week) perfect for someone with 8-12 years of experience who thrives on challenges and is passionate about building robust, scalable, and highly available systems.

You'll play a crucial role in ensuring the reliability, performance, and efficiency of our critical infrastructure and applications, with a particular focus on Kubernetes, DevOps, and observability. If you have hands-on experience with ML applications, GPU optimization, and Big Data systems, you'll be an ideal fit.

Key Responsibilities :

As a Site Reliability Engineer (SRE-2), you will :

- Design, deploy, and manage highly available and scalable Kubernetes clusters and robust DevOps pipelines.

- Troubleshoot and resolve complex infrastructure and application issues across various environments.

- Implement, maintain, and enhance comprehensive observability solutions, with a strong emphasis on Thanos and related monitoring and alerting tools.

- Provide expert support for machine learning (ML) workflows, leveraging tools like MLflow and Kubeflow.

- Optimize applications to maximize performance in GPU-accelerated environments.

- Contribute individually to projects and proactively learn and adopt new technologies to stay ahead of industry trends.

- Automate repetitive tasks and streamline operational processes using a diverse set of scripting and automation tools including Python, Ansible, Groovy, and Shell scripting.

Qualifications :

To be successful in this role, you should have :

- Strong, hands-on experience with Kubernetes and a deep understanding of core DevOps principles and tools.

- Proven expertise in observability and monitoring solutions, with a strong preference for experience with Thanos.

- Demonstrable experience working with ML platforms and optimizing applications for GPU-based environments.

- CKS (Certified Kubernetes Security Specialist) certification is preferred.

- Experience with Big Data systems is a significant plus.

- Proficiency in multiple scripting and automation languages : Python, Ansible, Groovy, and Shell scripting.

- Hands-on experience with CI/CD tools such as Jenkins, Ansible, and ArgoCD.