As Lead/Staff AI Runtime Engineer, youll play a pivotal role in the design, development, and optimization of the core runtime infrastructure that powers distributed training and deployment of large AI models (LLMs and beyond). This is a hands-on leadership role - perfect for a systems-minded software engineer who thrives at the intersection of AI workloads, runtimes, and performance-critical infrastructure. Youll own critical components of our PyTorch-based stack, lead technical direction, and collaborate across engineering, research, and product to push the boundaries of elastic, fault-tolerant, high-performance model execution.:

Experience : 5 Years to 7 Years

Location : Bangalore

Mode : Work from Office

Notice Period : Immediate Joiner or Max up to 30 days serving notice

Mandatory :

- Strong AI Runtime Engineering (Lead / Staff) Profiles

- Must have 4+ years of software engineering experience

- Must have proven 1+ years of experience designing, building, and owning AI runtime infrastructure supporting distributed training and/or inference at scale

- Must have hands-on experience optimizing deep learning runtimes such as PyTorch, TensorFlow, etc

- Must have strong low-level performance engineering experience, including profiling, debugging, and optimizing system throughput, latency, and reliability

- Must have experience leading or mentoring a team, including technical guidance, code reviews, and delivery ownership

- Must have strong programming skills in Python, Java, C++, etc

Preferred :

- Preferred (AI Infrastructure) Experience with Kubernetes, Ray, TorchElastic, or custom AI job orchestration frameworks

- Preferred (LLM Systems) Exposure to LLM training pipelines, checkpointing, elastic or distributed training orchestration

Ideal Candidate :

- 5+ years of experience in systems/software engineering, with deep exposure to AI runtime, distributed systems, or compiler/runtime interaction.

- Experience in delivering PaaS services.

- Proven experience optimizing and scaling deep learning runtimes (e.g. PyTorch, TensorFlow, JAX) for large-scale training and/or inference.

- Strong programming skills in Python and C++ (Go or Rust is a plus).

- Familiarity with distributed training frameworks, low-level performance tuning, and resource orchestration.

- Experience working with multi-GPU, multi-node, or cloud-native AI workloads.

- Solid understanding of containerized workloads, job scheduling, and failure recovery in production environments.

Did you find something suspicious?

Similar jobs that you might be interested in

Posted by

Dipti Mehrotra

Director at MNM HIRETECH PVT LTD

Last Active: 5 Feb 2026

Job Views:
33

Applications: 3

Recruiter Actions: 0

Posted in

AI/ML

Functional Area

ML / DL / AI Research

Job Code

1597574

Jobs by location

Interview Questions for you

View All

How to Write Leave Application for Urgent Work: Format & Samples (2025)

Top 90+ Machine Learning Interview Questions and Answers

Top 40+ Deep Learning Interview Questions and Answers