Posted on: 16/07/2025
We are seeking a highly skilled Kubernetes Orchestration Engineer with a strong background in managing GPU-accelerated AI/ML workloads in large-scale, containerized environments.
The ideal candidate will have experience orchestrating high-performance compute (HPC) clusters using Kubernetes and integrating with GPU workloads, deep learning frameworks, and cloud-native infrastructure.
Key Responsibilities :
- Design, deploy, and manage Kubernetes clusters optimized for GPU-based AI/ML workloads
- Implement and maintain container orchestration for deep learning pipelines using tools such as Kubeflow, NVIDIA GPU Operator, and KubeVirt
- Work closely with data scientists and AI engineers to ensure optimal resource allocation, scheduling, and job orchestration
- Automate deployment and scaling of AI/ML workloads across hybrid and multi-cloud environments
- Integrate Kubernetes with storage and networking solutions tailored for HPC and low-latency workloads
- Monitor and fine-tune cluster performance for GPU utilization, node availability, and job throughput
- Design CI/CD pipelines to manage containerized AI models and data science workflows
- Ensure compliance, security, and fault tolerance of the orchestration platform
Required Skills & Experience :
- 4+ years of experience working in DevOps/SRE or Platform Engineering with Kubernetes
- Hands-on experience managing GPU-enabled Kubernetes clusters
- Experience deploying and operating Kubeflow, Argo Workflows, or MLFlow in Kubernetes
- Strong knowledge of containerization using Docker, container registries, and Helm charts
- Experience with cloud-native infrastructure and GPU instance types
- Familiarity with AI/ML tools like TensorFlow, PyTorch, HuggingFace, and deep learning pipelines in production
- Proficiency with monitoring tools and logging
- Working knowledge of Linux systems, scripting, and infrastructure-as-code
Preferred / Bonus Skills :
- Knowledge of Kubernetes-native storage solutions and GPU-aware scheduling policies
- Background in HPC clusters, Slurm, or MPI workloads
- Familiarity with Service Meshes
- Certifications: CKA, CKAD, NVIDIA Certified Developer
Did you find something suspicious?
Posted By
Posted in
DevOps / SRE
Functional Area
DevOps / Cloud
Job Code
1514024
Interview Questions for you
View All