Posted on: 19/12/2025
Description :
Role :
Were hiring a GPU Optimization Engineer who understands GPUs at a deep, architectural level someone who knows exactly how to squeeze every last millisecond out of a model, what GPU constraints matter, and how to restructure models for real-world inference performance.
Youll work across CUDA kernels, model graph optimizations, hardware-specific tuning, and porting models across GPU architectures.
Your work directly impacts the latency, throughput, and reliability of smallests real-time speech models.
What Youll Do :
- Optimize model architectures (ASR, TTS, SLMs) for maximum performance on specific GPU hardware.
- Profile models end-to-end to identify GPU bottlenecks memory bandwidth, kernel launch overhead, fusion opportunities, quantization constraints.
- Design and implement custom kernels (CUDA/Triton/Tinygrad) for performance-critical model sections.
- Perform operator fusion, graph optimization, and kernel-level scheduling improvements.
- Tune models to fit GPU memory limits while maintaining quality.
- Benchmark and calibrate inference across NVIDIA, AMD, and potentially emerging accelerators.
- Port models across GPU chipsets (NVIDIA ? AMD / edge GPUs / new compute backends).
- Work with TensorRT, ONNX Runtime, and custom runtimes for deployment.
- Partner with the research and infra teams to ensure the entire stack is optimized for real-time workloads.
Requirements :
- Strong understanding of GPU architecture SMs, warps, memory hierarchy, occupancy tuning.
- Hands-on experience with CUDA, kernel writing, and kernel-level debugging.
- Experience with kernel fusion and model graph optimizations.
- Familiarity with TensorRT, ONNX, Triton, tinygrad, or similar inference engines.
- Strong proficiency in PyTorch and Python.
- Deep understanding of model architectures (transformers, convs, RNNs, attention, diffusion blocks).
- Experience profiling GPU workloads using Nsight, nvprof, or similar tools.
- Strong problem-solving abilities with a performance-first mindset.
Great to Have :
- Experience with quantization (INT8, FP8, hybrid formats).
- Experience with audio/speech models (ASR, TTS, SSL, vocoders).
- Contributions to open-source GPU stacks or inference runtimes.
- Published work related to systems-level model optimization.
Who Will Succeed in This Role :
Someone who :
- thinks in kernels, not just layers.
- knows which optimizations are theoretical vs practically impactful.
- understands GPU boundaries (memory, bandwidth, latency) and how to work around them.
- is excited by the challenge of ultra-low latency and large-scale real-time inference.
- loves debugging at the CUDA + model level.
Did you find something suspicious?