Posted on: 23/12/2025
Description :
Job Title : GPU Administrator
Job Summary :
We are looking for an experienced GPU Administrator to manage and optimize our GPU-based compute infrastructure used for AI/ML, data processing, and high-performance workloads.
The ideal candidate will have strong expertise in Linux, GPU hardware/software, container technologies, and performance tuning.
Key Responsibilities :
GPU & Compute Infrastructure :
- Install, configure, and maintain GPU servers, clusters, and workstations (NVIDIA/AMD).
- Manage GPU drivers, CUDA/cuDNN versions, firmware, and toolkit upgrades.
- Monitor GPU utilization, thermals, memory usage, and hardware health.
Linux System Administration :
- Perform OS patching, security hardening, and system performance optimization.
- Handle storage, networking, and user management for GPU workloads.
Containers & Orchestration :
- Manage NVIDIA GPU Operator, device plugins, and container runtime settings.
- Optimize workload scheduling and resource allocation in multi-tenant environments.
AI/ML & HPC Support :
- Support data scientists/ML engineers with environment setup and troubleshooting.
- Manage libraries/frameworks : PyTorch, TensorFlow, RAPIDS, JAX, etc.
- Work with distributed training tools (NCCL, Horovod, DeepSpeed) and HPC schedulers (SLURM/Ray).
Monitoring & Troubleshooting :
- Diagnose GPU performance issues, driver conflicts, and hardware failures.
- Conduct capacity planning and preventive maintenance.
Automation & DevOps :
- Integrate GPU systems into CI/CD pipelines where required.
Required Skills :
- Experience with NVIDIA/AMD GPU hardware & drivers.
- Proficiency in CUDA/cuDNN.
- Docker and Kubernetes (GPU-enabled).
- Scripting : Python / Bash.
- System monitoring & performance tuning.
- Troubleshooting GPU, OS, and container-level issues.
Preferred Skills :
- Cloud (AWS/GCP/Azure GPU instances).
- Distributed ML training frameworks.
- Infrastructure-as-Code (Ansible, Terraform).
- Familiarity with networking and storage concepts.
Did you find something suspicious?
Posted by
Posted in
Semiconductor/VLSI/EDA
Functional Area
Systems Administration
Job Code
1594197
Interview Questions for you
View All