Posted on: 22/09/2025
Job Description :
Customer Interview
No location criteria
Key Responsibilities :
- Analyze tracing logs from LLM inference and training runs to identify performance issues and inefficiencies.
- Develop tools and scripts to parse, visualize, and monitor LLM tracing data.
- Collaborate with ML and infra teams to recommend and implement performance optimizations.
- Create documentation and dashboards to track optimization progress over time.
- Investigate and resolve model latency and throughput issues related to runtime behavior.
- Contribute to best practices for performance tracing, benchmarking, and logging across model deployments.
Required Qualifications :
- Bachelors or Masters degree in Computer Science, Machine Learning, or related field.
- Experience working with large-scale ML models, preferably LLMs (e.g., GPT, BERT, etc.)
- Proficiency in Python and common ML frameworks (e.g., PyTorch, TensorFlow).
- Familiarity with model tracing tools such as PyTorch Profiler, TensorBoard, DeepSpeed, or similar.
- Strong problem-solving skills and attention to detail in analyzing complex logs and metrics.
Preferred Qualifications :
- Experience with distributed training/inference and GPU performance optimization.
- Knowledge of systems profiling tools (e.g., NVIDIA Nsight, perf, Flamegraphs).
- Background in MLOps, observability, or AI infrastructure.
Did you find something suspicious?