HamburgerMenu
hirist

GPU Infrastructure Specialist - ELK Stack

Watsonite
Anywhere in India/Multiple Locations
3 - 8 Years

Posted on: 05/01/2026

Job Description

Description :


We are hiring a GPU Infrastructure Engineer for a reputed global client to support and scale high-performance computing and AI/ML workloads. This is a 100% remote opportunity, offering exposure to cutting-edge GPU technologies and large-scale infrastructure.

Job Title : GPU Infrastructure Specialist

Location : Remote

Experience Level : 3+ Years

Department : Data & Analytics

Role Overview :


We are looking for a GPU Infrastructure Specialist to manage and optimize GPU-based environments for model hosting and high-performance computing workloads. The ideal candidate will have hands-on experience with NVIDIA/ AMD,.SambaNova GPU ecosystems, and a strong background in resource management, performance tuning, and observability within large-scale AI/ML environments.

Key Responsibilities :


- Manage, configure, and maintain GPU infrastructure across on-premise and cloud environments.

- Handle GPU resource allocation, scheduling, and orchestration for AI/ML workloads.

- Oversee driver updates, operator management, and compatibility across multiple GPU vendors

(NVIDIA, AMD, SambaNova).

- Implement GPU tuning and performance optimization strategies to ensure efficient model inference and training performance.

- Monitor GPU utilization, latency, and system health using observability and alerting tools (e.g., Prometheus, Grafana, NVIDIA DCGM, etc.).

- Collaborate with AI engineers, DevOps, and MLOps teams to ensure seamless model deployment and hosting across GPU clusters.

- Develop automation scripts and workflows for GPU provisioning, scaling, and lifecycle management.

- Troubleshoot GPU performance issues, memory bottlenecks, and hardware-level anomalies.

Required Skills & Experience :

- Strong experience managing GPU infrastructure (NVIDIA, AMD, SambaNova).

- Proficiency in resource scheduling and orchestration (Kubernetes, Slurm, Ray, or similar).

- Knowledge of driver and operator management in multi-vendor environments.

- Experience with GPU tuning, profiling, and performance benchmarking.

- Familiarity with observability and alerting tools (Prometheus, Grafana, ELK Stack, etc.).

- Hands-on experience with model hosting platforms (Triton Inference Server, TensorRT, ONNX Runtime, etc.) is a plus.

- Working knowledge of Linux systems, Docker/Kubernetes, and CI/CD pipelines.

- Strong scripting skills in Python, Bash, or Go.

Preferred Qualifications :

- Bachelors or Masters degree in Computer Science, Engineering, or related field.

- Certifications in GPU computing (e.g., NVIDIA Certified Administrator, CUDA, or similar).

- Experience with AI/ML model lifecycle management in production environments.


info-icon

Did you find something suspicious?

Similar jobs that you might be interested in