Description :

Job Title : GPU Administrator

Job Summary :

We are looking for an experienced GPU Administrator to manage and optimize our GPU-based compute infrastructure used for AI/ML, data processing, and high-performance workloads.

The ideal candidate will have strong expertise in Linux, GPU hardware/software, container technologies, and performance tuning.

Key Responsibilities :

GPU & Compute Infrastructure :

- Install, configure, and maintain GPU servers, clusters, and workstations (NVIDIA/AMD).

- Manage GPU drivers, CUDA/cuDNN versions, firmware, and toolkit upgrades.

- Monitor GPU utilization, thermals, memory usage, and hardware health.

Linux System Administration :

- Manage Linux environments (Ubuntu, RHEL, CentOS).

- Perform OS patching, security hardening, and system performance optimization.

- Handle storage, networking, and user management for GPU workloads.

Containers & Orchestration :

- Configure GPU-enabled Docker/Kubernetes environments.

- Manage NVIDIA GPU Operator, device plugins, and container runtime settings.

- Optimize workload scheduling and resource allocation in multi-tenant environments.

AI/ML & HPC Support :

- Support data scientists/ML engineers with environment setup and troubleshooting.

- Manage libraries/frameworks : PyTorch, TensorFlow, RAPIDS, JAX, etc.

- Work with distributed training tools (NCCL, Horovod, DeepSpeed) and HPC schedulers (SLURM/Ray).

Monitoring & Troubleshooting :

- Implement monitoring tools : DCGM, Prometheus, Grafana.

- Diagnose GPU performance issues, driver conflicts, and hardware failures.

- Conduct capacity planning and preventive maintenance.

Automation & DevOps :

- Automate deployments using Bash, Python, Ansible, or Terraform.

- Integrate GPU systems into CI/CD pipelines where required.

Required Skills :

- Strong knowledge of Linux Administration.

- Experience with NVIDIA/AMD GPU hardware & drivers.

- Proficiency in CUDA/cuDNN.

- Docker and Kubernetes (GPU-enabled).

- Scripting : Python / Bash.

- System monitoring & performance tuning.

- Troubleshooting GPU, OS, and container-level issues.

Preferred Skills :

- HPC cluster management (SLURM, PBS, Ray).

- Cloud (AWS/GCP/Azure GPU instances).

- Distributed ML training frameworks.

- Infrastructure-as-Code (Ansible, Terraform).

- Familiarity with networking and storage concepts.

Did you find something suspicious?

Posted by

Anuhya Arani

Talent acquisition specialist at INFOBELL IT SOLUTIONS PVT LTD

Last Active: 24 Dec 2025

Job Views:
5

Applications: 0

Recruiter Actions: 0

Posted in

Semiconductor/VLSI/EDA

Functional Area

Systems Administration

Job Code

1594197

Jobs by location

Interview Questions for you

View All

Top 20+ SOC Analyst Interview Questions and Answers

Top 20+ NumPy Interview Questions and Answers

Top 25+ CCNA Interview Questions and Answers