As Lead/Staff AI Runtime Engineer, you'll play a pivotal role in the design, development, and optimization of the core runtime infrastructure that powers distributed training and deployment of large AI models (LLMs and beyond). This is a hands-on leadership role - perfect for a systems-minded software engineer who thrives at the intersection of AI workloads, runtimes, and performance-critical infrastructure.

You'll own critical components of our PyTorch-based stack, lead technical direction, and collaborate across engineering, research, and product to push the boundaries of elastic, fault-tolerant, high-performance model execution.

The core responsibilities for the job include the following :

Lead Runtime Design and Development :

- Own the core runtime architecture supporting AI training and inference at scale.

- Design resilient and elastic runtime features (e. g., dynamic node scaling, job recovery) within our custom PyTorch stack.

- Optimize distributed training reliability, orchestration, and job-level fault tolerance.

Drive Performance at Scale :

- Profile and enhance low-level system performance across training and inference pipelines.

- Improve packaging, deployment, and integration of customer models in production environments.

- Ensure consistent throughput, latency, and reliability metrics across multi-node, multi-GPU setups.

Build Internal Tooling and Frameworks :

- Design and maintain libraries and services that support model lifecycle : training, checkpointing, fault recovery, packaging, and deployment.

- Implement observability hooks, diagnostics, and resilience mechanisms for deep learning workloads.

- Champion best practices in CI/CD, testing, and software quality across the AI runtime stack.

Collaborate and Mentor :

- Work cross-functionally with Research, Infrastructure, and Product teams to align runtime development with customer and platform needs.

- Guide technical discussions, mentor junior engineers, and help scale the AI Runtime team's capabilities.

Requirements :

- 8+ years of experience in systems/software engineering, with deep exposure to AI runtime, distributed systems, or compiler/runtime interaction.

- Experience in delivering PaaS services.

- Proven experience optimizing and scaling deep learning runtimes (e. g., PyTorch, TensorFlow, JAX) for large-scale training and/or inference.

- Strong programming skills in Python and C++ (Go or Rust is a plus).

- Familiarity with distributed training frameworks, low-level performance tuning, and resource orchestration.

- Experience working with multi-GPU, multi-node, or cloud-native AI workloads.

- Solid understanding of containerized workloads, job scheduling, and failure recovery in production environments.

Bonus Points :

- Contributions to PyTorch internals or open-source DL infrastructure projects.

- Familiarity with LLM training pipelines, checkpointing, or elastic training orchestration.

- Experience with Kubernetes, Ray, TorchElastic, or custom AI job orchestrators.

- Background in systems research, compilers, or runtime architecture for HPC or ML. Start-up previous experience.

Did you find something suspicious?

Similar jobs that you might be interested in

Posted by

Niraimathi

Talent Advisor at Scaling Theory Technologies Pvt Ltd

Last Active: 14 Jan 2026

Job Views:
31

Applications: 10

Recruiter Actions: 0

Posted in

AI/ML

Functional Area

ML / DL Engineering

Job Code

1601346

Jobs by location

Interview Questions for you

View All

How to Write Leave Application for Urgent Work: Format & Samples (2025)

Top 90+ Machine Learning Interview Questions and Answers

Top 40+ Deep Learning Interview Questions and Answers