HamburgerMenu
hirist

R&D Manager - HPC & GPU

Era Recruitment Services
Multiple Locations
5 - 12 Years

Posted on: 14/12/2025

Job Description

Description :

KLAs AI Advanced Computing Labs is looking for an extraordinary HPC System R&D Engineer to join its team to develop system-level HPC technologies that would form the foundation of next-generation clusters used in KLA tools that leverage AI to push the boundaries of process control for conductor manufacturing. The technologies would be developed and demonstrated on on-prem clusters that serve as testbeds for next-generation KLA tools.

Your Day-to-day Roles :

- Expose limitations in existing solutions, based on clusters of CPUs & GPUs, to deploy AI-based solutions on on-prem & cloud infrastructures at scale.

- Develop distributed frameworks and system-level solutions that enable scaling out image processing & AI loads from single GPU to multi-node clusters with multiple GPUs.

- Install, benchmark, and evaluate pre-release hardware for early-stage evaluation and prototyping by identifying (or developing) relevant workloads.

Minimum Qualifications :

- Masters / PhD in Computer Science or related fields; bachelors degree holders with relevant experience and extraordinary track-record will also be considered.

- Deep understanding of operating systems, computer networks, and high performance applications

- Good mental model of the architecture of a modern distributed systems that is comprised of CPUs, GPUs, and accelerators.

- Experience with deployments of deep-learning frameworks based on TensorFlow, and PyTorch on large-scale on-prem or cloud infrastructures.

- Strong background in modern and advanced C++ concepts

- Strong Scripting Skills in Bash, Python, or similar.

- Good communication.

Things to Make us go Wow! :

- Experience in heterogenous programming languages like CUDA, Triton, etc.

- Experience with model development on DL frameworks such as TensorFlow, and PyTorch

- Experience with building open-source operating systems and software stack on pre-release hardware.

- Solid understanding of container infrastructure such as Docker or singularity, and Kubernetes.

- Active participation in C++ standards bodies or similar

info-icon

Did you find something suspicious?