HamburgerMenu
hirist

Staff AI Runtime Engineer

Scaling Theory Technologies Pvt Ltd
Bangalore
8 - 12 Years

Posted on: 03/01/2026

Job Description

Description :


Responsibilities :


- Own the core runtime architecture supporting AI training and inference at scale.


- Design resilient and elastic runtime features (e. g. dynamic node scaling, job recovery) within our custom PyTorch stack.


- Optimise distributed training reliability, orchestration, and job-level fault tolerance.


- Profile and enhance low-level system performance across training and inference pipelines.


- Improve packaging, deployment, and integration of customer models in production environments.


- Ensure consistent throughput, latency, and reliability metrics across multi-node, multi-GPU setups.


- Design and maintain libraries and services that support the model lifecycle : training, checkpointing, fault recovery, packaging, and deployment.


- Implement observability hooks, diagnostics, and resilience mechanisms for deep learning workloads.


- Champion best practices in CI/CD, testing, and software quality across the AI Runtime stack.


- Work cross-functionally with Research, Infrastructure, and Product teams to align runtime development with customer and platform needs.


- Guide technical discussions, mentor junior engineers, and help scale the AI Runtime team's capabilities.


Requirements :


- 8+ years of experience in systems/software engineering, with deep exposure to AI runtime, distributed systems, or compiler/runtime interaction.


- Experience in delivering PaaS services.


- Proven experience optimising and scaling deep learning runtimes (e. g. PyTorch, TensorFlow, JAX) for large-scale training and/or inference.


- Strong programming skills in Python and C++ (Go or Rust is a plus).


- Familiarity with distributed training frameworks, low-level performance tuning, and resource orchestration.


- Experience working with multi-GPU, multi-node, or cloud-native AI workloads.


- Solid understanding of containerised workloads, job scheduling, and failure recovery in production environments.


Bonus Points :


- Contributions to PyTorch internals or open-source DL infrastructure projects.


- Familiarity with LLM training pipelines, checkpointing, or elastic training orchestration.


- Experience with Kubernetes, Ray, TorchElastic, or custom AI job orchestrators.


- Background in systems research, compilers, or runtime architecture for HPC or ML.


- Start up previous experience.


info-icon

Did you find something suspicious?

Similar jobs that you might be interested in