Posted on: 14/01/2026
Description :
You'll own critical components of our PyTorch-based stack, lead technical direction, and collaborate across engineering, research, and product to push the boundaries of elastic, fault-tolerant, high-performance model execution.
The core responsibilities for the job include the following :
Lead Runtime Design and Development :
- Design resilient and elastic runtime features (e. g., dynamic node scaling, job recovery) within our custom PyTorch stack.
- Optimize distributed training reliability, orchestration, and job-level fault tolerance.
Drive Performance at Scale :
- Improve packaging, deployment, and integration of customer models in production environments.
- Ensure consistent throughput, latency, and reliability metrics across multi-node, multi-GPU setups.
Build Internal Tooling and Frameworks :
- Implement observability hooks, diagnostics, and resilience mechanisms for deep learning workloads.
- Champion best practices in CI/CD, testing, and software quality across the AI runtime stack.
Collaborate and Mentor :
- Guide technical discussions, mentor junior engineers, and help scale the AI Runtime team's capabilities.
Requirements :
- Experience in delivering PaaS services.
- Proven experience optimizing and scaling deep learning runtimes (e. g., PyTorch, TensorFlow, JAX) for large-scale training and/or inference.
- Strong programming skills in Python and C++ (Go or Rust is a plus).
- Familiarity with distributed training frameworks, low-level performance tuning, and resource orchestration.
- Experience working with multi-GPU, multi-node, or cloud-native AI workloads.
- Solid understanding of containerized workloads, job scheduling, and failure recovery in production environments.
Bonus Points :
- Familiarity with LLM training pipelines, checkpointing, or elastic training orchestration.
- Experience with Kubernetes, Ray, TorchElastic, or custom AI job orchestrators.
- Background in systems research, compilers, or runtime architecture for HPC or ML. Start-up previous experience.
Did you find something suspicious?