Description :

About the Role :

We are seeking an experienced DevOps Engineer to join our infrastructure team, with a strong focus on managing and optimizing GPU-based compute environments for machine learning and deep learning workloads.

In this role, you will be responsible for the end-to-end infrastructure lifecyclefrom provisioning with Terraform/Ansible to deploying ML models using modern frameworks like Hugging Face and Ollama.

Key Responsibilities :

- Manage infrastructure using Terraform and Ansible

- Deploy and monitor Kubernetes clusters with GPU support (including NVIDIA drivers and H100 SXM integration)

- Implement and manage inferencing frameworks such as Ollama, Hugging Face, etc.

- Support containerization (Docker), logging (EFK), and monitoring (Prometheus/Grafana)

- Handle GPU resource scheduling, isolation, and scaling for ML/DL workloads

- Collaborate closely with developers, data scientists, and ML engineers to streamline deployments and performance

Required Skill Set :

- 5- 8 years of hands-on experience in DevOps and infrastructure automation

- Proven experience in managing GPU-based compute environments

- Strong understanding of Docker, Kubernetes, and Linux internals

- Familiarity with GPU server hardware and instance types

- Proficient in scripting with Python and Bash

- Good understanding of ML model deployment, inferencing workflows, and resource utilization/metering

Nice to Have :

- Experience with AI/ML pipelines

- Knowledge of cloud-native technologies (AWS/GCP/Azure) supporting GPU workloads

- Exposure to model performance benchmarking and A/B testing