HamburgerMenu
hirist

Job Description

Description :

Job Title : GPU Administrator

Job Summary :

We are looking for an experienced GPU Administrator to manage and optimize our GPU-based compute infrastructure used for AI/ML, data processing, and high-performance workloads.

The ideal candidate will have strong expertise in Linux, GPU hardware/software, container technologies, and performance tuning.

Key Responsibilities :

GPU & Compute Infrastructure :

- Install, configure, and maintain GPU servers, clusters, and workstations (NVIDIA/AMD).

- Manage GPU drivers, CUDA/cuDNN versions, firmware, and toolkit upgrades.

- Monitor GPU utilization, thermals, memory usage, and hardware health.

Linux System Administration :


- Manage Linux environments (Ubuntu, RHEL, CentOS).

- Perform OS patching, security hardening, and system performance optimization.

- Handle storage, networking, and user management for GPU workloads.

Containers & Orchestration :


- Configure GPU-enabled Docker/Kubernetes environments.

- Manage NVIDIA GPU Operator, device plugins, and container runtime settings.

- Optimize workload scheduling and resource allocation in multi-tenant environments.

AI/ML & HPC Support :

- Support data scientists/ML engineers with environment setup and troubleshooting.

- Manage libraries/frameworks : PyTorch, TensorFlow, RAPIDS, JAX, etc.

- Work with distributed training tools (NCCL, Horovod, DeepSpeed) and HPC schedulers (SLURM/Ray).

Monitoring & Troubleshooting :


- Implement monitoring tools : DCGM, Prometheus, Grafana.

- Diagnose GPU performance issues, driver conflicts, and hardware failures.

- Conduct capacity planning and preventive maintenance.

Automation & DevOps :


- Automate deployments using Bash, Python, Ansible, or Terraform.

- Integrate GPU systems into CI/CD pipelines where required.

Required Skills :


- Strong knowledge of Linux Administration.

- Experience with NVIDIA/AMD GPU hardware & drivers.

- Proficiency in CUDA/cuDNN.

- Docker and Kubernetes (GPU-enabled).

- Scripting : Python / Bash.

- System monitoring & performance tuning.

- Troubleshooting GPU, OS, and container-level issues.

Preferred Skills :


- HPC cluster management (SLURM, PBS, Ray).

- Cloud (AWS/GCP/Azure GPU instances).

- Distributed ML training frameworks.

- Infrastructure-as-Code (Ansible, Terraform).

- Familiarity with networking and storage concepts.


info-icon

Did you find something suspicious?