Posted on: 19/12/2025
We are looking for an experienced MLOps Engineer to build, automate, and maintain end-to-end machine learning pipelines and production environments.
The ideal candidate has strong experience with ML model deployment, workflow orchestration, CI/CD automation, cloud platforms, and scalable architecture for real-time or batch ML systems.
You will work closely with data scientists, ML engineers, and DevOps teams to ensure models are efficiently deployed, monitored, optimized, and continuously improved.
Key Responsibilities :
ML Pipeline Development & Automation :
- Build and manage scalable ML pipelines for data preparation, training, validation, and deployment.
- Create automated workflows using tools like Kubeflow, MLflow, Airflow, Vertex AI Pipelines, or SageMaker Pipelines.
- Implement versioning of datasets, models, and experiments.
Model Deployment & Serving :
- Deploy ML models on cloud environments (AWS/GCP/Azure) or on-prem.
- Implement real-time model serving using Docker, Kubernetes, KServe, TorchServe, TensorFlow Serving, or FastAPI.
- Develop APIs for inference and integrate models into production systems.
CI/CD for ML (Continuous Integration & Delivery) :
- Build automated CI/CD pipelines for model training, packaging, and deployment.
- Ensure safe rollouts with canary deployments, A/B tests, and rollback strategies.
- Maintain Git-based workflows for code, model, and pipeline updates.
Monitoring, Observability & Maintenance :
- Implement end-to-end monitoring for model performance, drift detection, data quality, and metrics.
- Set up logging and alerting using Prometheus, Grafana, ELK/EFK, CloudWatch.
- Automate model retraining triggers based on performance thresholds or data drift.
Infrastructure Management :
- Build and maintain cloud-based ML infrastructure (compute, storage, networking).
- Work with IaC tools like Terraform, CloudFormation, or Pulumi.
- Optimize resource usage, GPU allocation, and cost efficiency.
Collaboration & Documentation :
- Work closely with data scientists to productionize notebooks and prototype models.
- Convert experimental code into scalable, maintainable components.
- Document workflows, architecture, pipeline steps, and best practices.
ML Governance, Versioning & Security :
- Implement model registries (MLflow, SageMaker Model Registry, Vertex AI Model Registry).
- Ensure compliance with security, PII handling, privacy, and governance policies.
- Manage secrets, credentials, and secure access for ML systems.
Required Skills & Qualifications :
Technical Skills :
- Strong understanding of ML lifecycle, model deployment, and production ML.
- Proficiency in Python, ML frameworks (PyTorch, TensorFlow, Scikit-learn).
- Hands-on experience with Docker, Kubernetes, Helm charts.
- Experience with MLflow, Kubeflow, Airflow, Jenkins, GitHub Actions, or Azure DevOps.
- Cloud experience with AWS (SageMaker, ECS/EKS), GCP (Vertex AI), or Azure ML.
- Knowledge of monitoring tools, APIs, REST/GraphQL, and microservices.
- Familiarity with feature stores (Feast, Tecton) is a plus.
Soft Skills :
- Strong problem-solving and analytical mindset.
- Excellent collaboration with DS/DE/DevOps teams.
- Clear communication and documentation abilities.
- Ability to work independently and handle fast-paced environments.
Preferred Qualifications :
- Experience with GPU-based training and model optimization.
- Exposure to data engineering tools (Spark, Kafka, Databricks).
- Familiarity with distributed training frameworks (Horovod, DeepSpeed).
- Prior experience deploying LLM or deep learning models
Did you find something suspicious?
Posted by
Posted in
DevOps / SRE
Functional Area
DevOps / Cloud
Job Code
1592552
Interview Questions for you
View All