Posted on: 07/01/2026
Job Role :
As Lead/Staff AI Runtime Engineer, youll play a pivotal role in the design, development, and optimization of the core runtime infrastructure that powers distributed training and deployment of large AI models (LLMs and beyond). This is a hands-on leadership role - perfect for a systems-minded software engineer who thrives at the intersection of AI workloads, runtimes, and performance-critical infrastructure. Youll own critical components of our PyTorch-based stack, lead technical direction, and collaborate across engineering, research, and product to push the boundaries of elastic, fault-tolerant, high-performance model execution.:
Experience : 5 Years to 7 Years
Location : Bangalore
Mode : Work from Office
Notice Period : Immediate Joiner or Max up to 30 days serving notice
Mandatory :
- Strong AI Runtime Engineering (Lead / Staff) Profiles
- Must have 4+ years of software engineering experience
- Must have proven 1+ years of experience designing, building, and owning AI runtime infrastructure supporting distributed training and/or inference at scale
- Must have hands-on experience optimizing deep learning runtimes such as PyTorch, TensorFlow, etc
- Must have strong low-level performance engineering experience, including profiling, debugging, and optimizing system throughput, latency, and reliability
- Must have experience leading or mentoring a team, including technical guidance, code reviews, and delivery ownership
- Must have strong programming skills in Python, Java, C++, etc
Preferred :
- Preferred (AI Infrastructure) Experience with Kubernetes, Ray, TorchElastic, or custom AI job orchestration frameworks
- Preferred (LLM Systems) Exposure to LLM training pipelines, checkpointing, elastic or distributed training orchestration
Ideal Candidate :
- 5+ years of experience in systems/software engineering, with deep exposure to AI runtime, distributed systems, or compiler/runtime interaction.
- Experience in delivering PaaS services.
- Proven experience optimizing and scaling deep learning runtimes (e.g. PyTorch, TensorFlow, JAX) for large-scale training and/or inference.
- Strong programming skills in Python and C++ (Go or Rust is a plus).
- Familiarity with distributed training frameworks, low-level performance tuning, and resource orchestration.
- Experience working with multi-GPU, multi-node, or cloud-native AI workloads.
- Solid understanding of containerized workloads, job scheduling, and failure recovery in production environments.
Did you find something suspicious?