Posted on: 10/12/2025
Job Title : MLOps Engineer
Experience : 5+ Years
Location : Remote
Interview Mode : Fully Virtual
Job Type : Contract / Full-time
Job Description :
We are seeking an experienced ML Ops Engineer to design, build, and manage scalable machine learning infrastructure and deployment pipelines for AI solutions. The ideal candidate will have hands-on experience in GPU cluster development, model lifecycle management, and deployment of large-scale neural networks. This role requires expertise in orchestrating end-to-end AI and LLM (Large Language Model) applications in production environments.
Key Responsibilities :
ML Infrastructure & Cluster Management :
- Design, implement, and manage GPU clusters for training and inference of machine learning models.
- Optimize resource utilization and performance of large-scale ML workloads.
- Ensure high availability, reliability, and scalability of ML infrastructure.
Model Development & Deployment :
- Manage the entire model lifecycle, including development, training, validation, deployment, monitoring, and retraining.
- Deploy large-scale neural network models to production environments with efficiency and reliability.
- Implement best practices for continuous integration and continuous deployment (CI/CD) in ML workflows.
Orchestration & AI Solutions :
- Architect and deploy end-to-end AI solutions, integrating ML pipelines with data engineering and application layers.
- Orchestrate LLM applications and ensure optimal performance for large-scale inference.
- Collaborate with data scientists, ML engineers, and software engineers to deploy models into production.
Monitoring & Optimization :
- Monitor model performance, resource utilization, and infrastructure health in production.
- Identify bottlenecks and implement optimization strategies for training and inference pipelines.
- Ensure reproducibility, scalability, and robustness of ML models in production.
Required Skills & Qualifications :
- 5+ years of experience in ML Ops, AI infrastructure, or related roles.
- Hands-on experience with GPU cluster development and management.
- Strong understanding of model lifecycle management and end-to-end AI solutions.
- Experience in deployment of large-scale neural network models in production.
- Familiarity with orchestrating LLM applications.
- Proficiency with ML frameworks such as PyTorch, TensorFlow, or JAX.
- Strong programming and scripting skills (Python, Bash, or similar).
- Experience with CI/CD pipelines and ML workflow orchestration tools (e.g., Kubeflow, MLflow, Airflow).
Preferred Attributes :
- Experience in cloud-based ML infrastructure (AWS, GCP, Azure) and distributed training.
- Knowledge of containerization and orchestration (Docker, Kubernetes).
- Strong problem-solving, analytical, and debugging skills.
- Excellent communication skills and ability to work in a fully remote setup.
Did you find something suspicious?