- Design, implement, and optimize end-to-end ML training workflows including infrastructure setup, orchestration, fine-tuning, deployment, and monitoring.

- Evaluate and integrate multi-cloud and single-cloud training options across AWS and other major platforms.

- Lead cluster configuration, orchestration design, environment customization, and scaling strategies.

- Compare and recommend hardware options (GPUs, TPUs, accelerators) based on performance, cost, and availability.

Technical Expertise Requirements :

- At least 5 years in AI/ML infrastructure and large-scale training environments.

- Expert in AWS cloud services (EC2, S3, EKS, SageMaker, Batch, FSx, etc.) and familiar with Azure, GCP, and hybrid/multi-cloud setups.

- Strong knowledge of AI/ML training frameworks (PyTorch, TensorFlow, Hugging Face, DeepSpeed, Megatron, Ray, etc.).

- Proven experience with cluster orchestration tools (Kubernetes, Slurm, Ray, SageMaker, Kubeflow).

- Deep understanding of hardware architectures for AI workloads (NVIDIA, AMD, Intel Habana, TPU).

LLM Inference Optimization :

- Expert knowledge of inference optimization techniques including speculative decoding, KV cache optimization (MQA/GQA/PagedAttention), and dynamic batching.

- Deep understanding of prefill vs decode phases, memory-bound vs compute-bound operations.

- Experience with quantization methods (INT4/INT8, GPTQ, AWQ) and model parallelism strategies.

Inference Frameworks :

- Hands-on experience with production inference engines : vLLM, TensorRT-LLM, DeepSpeed-Inference, or TGI.

- Proficiency with serving frameworks : Triton Inference Server, KServe, or Ray Serve.

- Familiarity with kernel optimization libraries (FlashAttention, xFormers).

Performance Engineering :

- Proven ability to optimize inference metrics : TTFT (first token latency), ITL (inter-token latency), and throughput.

- Experience profiling and resolving GPU memory bottlenecks and OOM issues.

- Knowledge of hardware-specific optimizations for modern GPU architectures (A100/H100).

Fine tuning :

- Drive end-to-end fine-tuning of LLMs, including model selection, dataset preparation/cleaning, tokenization, and evaluation with baseline metrics.

- Configure and execute fine-tuning experiments (LoRA, QLoRA, etc.) on large-scale compute setups, ensuring optimal hyperparameter tuning, logging, and checkpointing.

- Document fine-tuning outcomes by capturing performance metrics (losses, BERT/ROUGE scores, training time, resource utilization) and benchmark against baseline models.