HamburgerMenu
hirist

AI Runtime Lead - Deep Learning

MNM HIRETECH PVT LTD
Bangalore
5 - 7 Years

Posted on: 07/01/2026

Job Description

Job Role :


As Lead/Staff AI Runtime Engineer, youll play a pivotal role in the design, development, and optimization of the core runtime infrastructure that powers distributed training and deployment of large AI models (LLMs and beyond). This is a hands-on leadership role - perfect for a systems-minded software engineer who thrives at the intersection of AI workloads, runtimes, and performance-critical infrastructure. Youll own critical components of our PyTorch-based stack, lead technical direction, and collaborate across engineering, research, and product to push the boundaries of elastic, fault-tolerant, high-performance model execution.:


Experience : 5 Years to 7 Years


Location : Bangalore


Mode : Work from Office


Notice Period : Immediate Joiner or Max up to 30 days serving notice


Mandatory :


- Strong AI Runtime Engineering (Lead / Staff) Profiles


- Must have 4+ years of software engineering experience


- Must have proven 1+ years of experience designing, building, and owning AI runtime infrastructure supporting distributed training and/or inference at scale


- Must have hands-on experience optimizing deep learning runtimes such as PyTorch, TensorFlow, etc


- Must have strong low-level performance engineering experience, including profiling, debugging, and optimizing system throughput, latency, and reliability


- Must have experience leading or mentoring a team, including technical guidance, code reviews, and delivery ownership


- Must have strong programming skills in Python, Java, C++, etc


Preferred :


- Preferred (AI Infrastructure) Experience with Kubernetes, Ray, TorchElastic, or custom AI job orchestration frameworks


- Preferred (LLM Systems) Exposure to LLM training pipelines, checkpointing, elastic or distributed training orchestration


Ideal Candidate :


- 5+ years of experience in systems/software engineering, with deep exposure to AI runtime, distributed systems, or compiler/runtime interaction.


- Experience in delivering PaaS services.


- Proven experience optimizing and scaling deep learning runtimes (e.g. PyTorch, TensorFlow, JAX) for large-scale training and/or inference.


- Strong programming skills in Python and C++ (Go or Rust is a plus).


- Familiarity with distributed training frameworks, low-level performance tuning, and resource orchestration.


- Experience working with multi-GPU, multi-node, or cloud-native AI workloads.


- Solid understanding of containerized workloads, job scheduling, and failure recovery in production environments.


info-icon

Did you find something suspicious?

Similar jobs that you might be interested in