Posted on: 22/04/2026
Description :
- Tune inference performance : reduce end-to-end latency and increase throughput across real production traffic patterns.
- Optimize runtimes and servers : Scale inference across heterogeneous GPU fleets; optimize stacks such as vLLM, Triton, and related components (e.g., schedulers, KV cache, batching, memory).
- Benchmark and measure : Build benchmarking suites, metrics, and tooling to quantify latency, throughput, GPU utilization, memory, and cost.
- Reliability and observability : Improve monitoring, tracing, and alerting; participate in incident response and postmortems to harden systems.
- Apply and ship new optimizations : Evaluate research and implement pragmatic inference optimizations (e.g., quantization, paging, kernel/runtimes improvements).
- Partner cross-functionally : Work with data science and product teams to translate business requirements into performance and availability SLOs.
What Were Looking For :
- Experience deploying and operating LLM inference services in production.
- Strong production coding skills in Python plus Go or Rust (systems-level implementation and debugging).
- Experience with ML frameworks and runtimes : PyTorch, vLLM, SGLang (and/or TensorRT).
- Knowledge of GPU architecture and performance (profiling, memory bandwidth/latency tradeoffs); CUDA/kernel programming is a strong plus.
- Solid understanding of LLM inference and optimization techniques : continuous batching, KV cache management, quantization, speculative decoding (nice-to-have), etc.
- 3+ years hands-on experience in performance optimization and systems programming for AI/ML workloads.
- Demonstrated ability to deliver measurable production improvements (e.g., 2X throughput, lower p95/p99 latency, reduced GPU cost).
- Proven skill in root-cause analysis : finding bottlenecks across model, runtime, networking, and infrastructure.
Did you find something suspicious?