Role : Machine Learning Performance Engineer (C++ Focus)

This Job Description outlines a highly specialized, technically demanding role for a Machine Learning Engineer with a primary focus on C++ and deep system-level performance optimization for AI/ML inferencing and training. It targets candidates who can bridge the gap between theoretical AI models and high-efficiency, production-ready code.

Here is a detailed, premium breakdown of the requirements and the role's scope :

Role Title : Machine Learning Performance Engineer

Experience Level : Mid-Senior (2 - 4 Years of Highly Relevant Experience) Location : Hyderabad, India

I. Core Engineering and System Performance (The C++ Mandate) :

This role requires expertise in building robust, low-latency infrastructure, not just training models.

- Deep Proficiency in C/C++ and Python : The foundational requirement is the ability to architect, implement, and maintain high-performance, resource-efficient code primarily in C++. Python is essential for tooling, rapid prototyping, and interfacing with ML frameworks.

- Operating System and Tooling Mastery : Candidates must possess expert-level hands-on experience with Linux commands for development, performance analysis, and deployment in server environments. This includes mandatory experience with scripting languages like Bash or PowerShell for automation and continuous integration pipelines.

- Advanced Debugging and Stability : Proven expertise in system and memory debugging using tools such as gdb and Valgrind is required to ensure code correctness, memory safety, and stability in production environments.

II. Deep Learning Optimization and Inferencing :

The successful candidate will be a performance specialist focused on execution speed and resource efficiency.

- ML Framework Interfacing : Strong understanding and experience utilizing popular Python ML frameworks like PyTorch and high-level libraries (e.g., Hugging Face Transformers).

- AI Inferencing Engine Expertise : Direct, demonstrable experience working with, optimizing, or contributing to high-performance model inferencing engines, specifically those targeting low-latency, efficient execution such as vLLM, ollama, llama.cpp, or sglang. Understanding the critical trade-offs between these engines is key.

- Algorithm Implementation : Deep understanding and practical experience in the implementation and optimization of core Machine Learning and AI algorithms, focusing on techniques like quantization, model compression, and efficient data handling.

III. GPU/Hardware Acceleration and Low-Level Programming :

This is the most specialized and differentiating part of the role, requiring intimate knowledge of hardware and compilers.

- Custom Kernel Development : Experience in writing and optimizing bespoke Deep Learning GPU Kernels to maximize arithmetic intensity and memory bandwidth. Familiarity with domain-specific languages and tools for this purpose, such as Triton or JAX, is a major asset.

- GPU and PC Architecture Knowledge : Mandatory working knowledge of GPU architecture (including memory hierarchy, thread scheduling) and overall PC architecture is required to inform optimal coding and performance profiling decisions.

- CUDA/ROCm Programming : Proven experience in writing and optimizing custom ROCm or CUDA Kernels/Shaders for specialized operations beyond standard library calls.

- Performance Profiling : Hands-on expertise using advanced profiling tools like NVIDIA Nsight Systems (nsys) or AMD ROCprofiler (rocprof) to precisely identify and eliminate performance bottlenecks in GPU and host code.

- Low-Level System Insight (A Strong Plus) : Knowledge of x86 assembly language and x86/x64 CPU instructions is valuable for extreme optimization, deeply understanding compiler behavior, and minimizing host-side overhead.

IV. Qualifications and Team Fit :

- Required Experience : 2 - 4 years in a role directly involving ML infrastructure, performance engineering, or high-performance computing (HPC).

- Availability : Immediate joiner is highly preferred, indicating a need for rapid integration into the team.

- Collaboration : Excellent technical communication skills are required, with the proven ability to articulate complex technical issues and work effectively with both technical teams and external stakeholders.