As an LLM Inference Engineer on our AI Platform team, youll remove the compute-scaling bottleneck for production LLMs. Your job is to make frontier-model inference fast, efficient, reliable, and observablethe last mile from GPUs to APIs that products depend on. This role sits at the intersection of HPC, GPU systems, and MLOps, and requires strong intuition for how model architecture, runtimes, and hardware interact.

What Youll Do :

- Own production inference : Take models from handoff to production-grade serving, including release engineering, capacity planning, cost optimization, and incident response.

- Tune inference performance : reduce end-to-end latency and increase throughput across real production traffic patterns.

- Optimize runtimes and servers : Scale inference across heterogeneous GPU fleets; optimize stacks such as vLLM, Triton, and related components (e.g., schedulers, KV cache, batching, memory).

- Benchmark and measure : Build benchmarking suites, metrics, and tooling to quantify latency, throughput, GPU utilization, memory, and cost.

- Reliability and observability : Improve monitoring, tracing, and alerting; participate in incident response and postmortems to harden systems.

- Apply and ship new optimizations : Evaluate research and implement pragmatic inference optimizations (e.g., quantization, paging, kernel/runtimes improvements).

- Partner cross-functionally : Work with data science and product teams to translate business requirements into performance and availability SLOs.

What Were Looking For :

- 5+ years of strong development experience

- Experience deploying and operating LLM inference services in production.

- Strong production coding skills in Python plus Go or Rust (systems-level implementation and debugging).

- Experience with ML frameworks and runtimes : PyTorch, vLLM, SGLang (and/or TensorRT).

- Knowledge of GPU architecture and performance (profiling, memory bandwidth/latency tradeoffs); CUDA/kernel programming is a strong plus.

- Solid understanding of LLM inference and optimization techniques : continuous batching, KV cache management, quantization, speculative decoding (nice-to-have), etc.

- 3+ years hands-on experience in performance optimization and systems programming for AI/ML workloads.

- Demonstrated ability to deliver measurable production improvements (e.g., 2X throughput, lower p95/p99 latency, reduced GPU cost).

- Proven skill in root-cause analysis : finding bottlenecks across model, runtime, networking, and infrastructure.