HamburgerMenu
hirist

TECEZE - Kubernetes Orchestration Engineer - Monitoring Tools

TECEZE CONSULTANCY SERVICES PRIVATE LIMITED
Dubai
4 - 7 Years

Posted on: 16/07/2025

Job Description

We are seeking a highly skilled Kubernetes Orchestration Engineer with a strong background in managing GPU-accelerated AI/ML workloads in large-scale, containerized environments.

The ideal candidate will have experience orchestrating high-performance compute (HPC) clusters using Kubernetes and integrating with GPU workloads, deep learning frameworks, and cloud-native infrastructure.


Key Responsibilities :

- Design, deploy, and manage Kubernetes clusters optimized for GPU-based AI/ML workloads

- Implement and maintain container orchestration for deep learning pipelines using tools such as Kubeflow, NVIDIA GPU Operator, and KubeVirt

- Work closely with data scientists and AI engineers to ensure optimal resource allocation, scheduling, and job orchestration

- Automate deployment and scaling of AI/ML workloads across hybrid and multi-cloud environments

- Integrate Kubernetes with storage and networking solutions tailored for HPC and low-latency workloads

- Monitor and fine-tune cluster performance for GPU utilization, node availability, and job throughput

- Design CI/CD pipelines to manage containerized AI models and data science workflows

- Ensure compliance, security, and fault tolerance of the orchestration platform


Required Skills & Experience :


- 4+ years of experience working in DevOps/SRE or Platform Engineering with Kubernetes

- Hands-on experience managing GPU-enabled Kubernetes clusters


- Experience deploying and operating Kubeflow, Argo Workflows, or MLFlow in Kubernetes

- Strong knowledge of containerization using Docker, container registries, and Helm charts

- Experience with cloud-native infrastructure and GPU instance types

- Familiarity with AI/ML tools like TensorFlow, PyTorch, HuggingFace, and deep learning pipelines in production

- Proficiency with monitoring tools and logging

- Working knowledge of Linux systems, scripting, and infrastructure-as-code


Preferred / Bonus Skills :


- Experience with KServe / KFServing, NVIDIA Triton Inference Server

- Knowledge of Kubernetes-native storage solutions and GPU-aware scheduling policies

- Background in HPC clusters, Slurm, or MPI workloads

- Familiarity with Service Meshes

- Certifications: CKA, CKAD, NVIDIA Certified Developer


info-icon

Did you find something suspicious?