Posted on: 18/08/2025
We're looking for someone who's as comfortable building ML pipelines as they are optimizing infrastructure for scale. If you thrive on solving real-world data challenges, love experimenting, and don't shy away from getting your hands dirty with deployment, this is your jam.
Responsibilities :
- Apply a strong understanding of machine learning principles and algorithms, with a focus on LLMs such as GPT-4 BERT, and similar architectures.
- Leveraged deep learning frameworks like TensorFlow, PyTorch, or Keras to train and fine-tune LLMs.
- Utilize deep knowledge of computer architecture, especially GPUs, to maximize utilization and efficiency.
- Work with cloud platforms(AWS, Azure, GCP) to manage and optimize resources for training large-scale deep learning models.
- Use containerization and orchestration tools(Docker, Kubernetes) for scalable and reproducible ML deployments.
- Apply principles of parallel and distributed computing, including distributed training for deep learning models.
- Work with big data and distributed computing technologies(Hadoop, Spark) to handle large-volume datasets.
- Implement MLOps practices and use related tools to manage the complete ML lifecycle.
- Contribute to the infrastructure side of multiple ML projects, particularly those involving deep learning models such as BERT and Transformers.
- Manage resources and optimize performance for large-scale ML workloads, both on-premise and in the cloud.
- Handle challenges in training large models, including memory management, optimizing data loading, and troubleshooting hardware issues.
- Collaborate closely with data scientists and ML engineers to understand infrastructure needs and deliver efficient solutions.
Requirements :
- Strong knowledge of machine learning and deep learning algorithms, especially LLMs.
- Proficiency in Python and deep learning frameworks (TensorFlow, PyTorch, Keras).
- Expertise in GPU architecture and optimization.
- Experience with parallel and distributed computing concepts.
- Hands-on with containerization (Docker) and orchestration (Kubernetes).
Tech Stack and Tools :
- Cloud : AWS, Azure, GCP.
- Big Data : Hadoop, Spark.
- MLOps Tools : MLflow, Kubeflow, or similar.
- Infrastructure Optimization : Resource allocation, distributed training, GPU performance tuning.
Nice-to-Have :
- Prior experience training large-scale deep learning models(BERT, Transformers).
- Exposure to high-scale environments and large datasets.
- Ability to troubleshoot hardware bottlenecks and optimize data pipelines.
Did you find something suspicious?