GPU Optimization Engineer - Python

TagLynk Careers Pvt Ltd

Bangalore

5 - 8 Years

2+ Reviews

GPU CUDA Kernel Tensorflow Python PyTorch

Posted on: 19/12/2025

Job Description

Description :

Role :

Were hiring a GPU Optimization Engineer who understands GPUs at a deep, architectural level someone who knows exactly how to squeeze every last millisecond out of a model, what GPU constraints matter, and how to restructure models for real-world inference performance.

Youll work across CUDA kernels, model graph optimizations, hardware-specific tuning, and porting models across GPU architectures.

Your work directly impacts the latency, throughput, and reliability of smallests real-time speech models.

What Youll Do :

- Optimize model architectures (ASR, TTS, SLMs) for maximum performance on specific GPU hardware.

- Profile models end-to-end to identify GPU bottlenecks memory bandwidth, kernel launch overhead, fusion opportunities, quantization constraints.

- Design and implement custom kernels (CUDA/Triton/Tinygrad) for performance-critical model sections.

- Perform operator fusion, graph optimization, and kernel-level scheduling improvements.

- Tune models to fit GPU memory limits while maintaining quality.

- Benchmark and calibrate inference across NVIDIA, AMD, and potentially emerging accelerators.

- Port models across GPU chipsets (NVIDIA ? AMD / edge GPUs / new compute backends).

- Work with TensorRT, ONNX Runtime, and custom runtimes for deployment.

- Partner with the research and infra teams to ensure the entire stack is optimized for real-time workloads.

Requirements :

- Strong understanding of GPU architecture SMs, warps, memory hierarchy, occupancy tuning.

- Hands-on experience with CUDA, kernel writing, and kernel-level debugging.

- Experience with kernel fusion and model graph optimizations.

- Familiarity with TensorRT, ONNX, Triton, tinygrad, or similar inference engines.

- Strong proficiency in PyTorch and Python.

- Deep understanding of model architectures (transformers, convs, RNNs, attention, diffusion blocks).

- Experience profiling GPU workloads using Nsight, nvprof, or similar tools.

- Strong problem-solving abilities with a performance-first mindset.

Great to Have :

- Experience with quantization (INT8, FP8, hybrid formats).

- Experience with audio/speech models (ASR, TTS, SSL, vocoders).

- Contributions to open-source GPU stacks or inference runtimes.

- Published work related to systems-level model optimization.

Who Will Succeed in This Role :

Someone who :

- thinks in kernels, not just layers.

- knows which optimizations are theoretical vs practically impactful.

- understands GPU boundaries (memory, bandwidth, latency) and how to work around them.

- is excited by the challenge of ultra-low latency and large-scale real-time inference.

- loves debugging at the CUDA + model level.