Posted on: 29/01/2026
Description :
Role Overview :
We are looking for a Machine Learning Engineer with strong MLOps, platform monitoring, and production ML deployment experience. This role focuses on ensuring high availability, performance, and reliability of ML platforms and models deployed using Domino Data Lab, while also building automated ML pipelines and operational tooling.
The ideal candidate will work closely with Data Scientists, ML Engineers, Platform Teams, and Vendors to ensure seamless model lifecycle management - from development to production and continuous monitoring.
Key Responsibilities :
1. Platform Monitoring & Reliability :
- Continuously monitor Domino Data Lab platform uptime, health, and availability
Track performance metrics for real-time and batch ML endpoints, including :
i. Availability
ii. Latency
iii. Throughput
- Maintain and enhance Grafana dashboards for :
i. Platform metrics
ii. Model deployment metrics
iii. Resource utilization (CPU, GPU, memory, network)
- Monitor computational performance, model drift, and model aging
Track changes in dependencies such as :
i. Data versions
ii. Feature sets
iii. Software and library upgrades
- Ensure proper incident logging, auditability, and observability
2. ML Platform Operations & Incident Management :
- Act as first responder for ML platform related incidents
- Log and manage incidents using ServiceNow
Perform :
i. Incident triage
ii. Root Cause Analysis (RCA)
iii. Resolution and preventive actions
- Document post-incident reports and drive long-term stability improvements
- Coordinate with Domino Data Lab support teams for platform-level issues
- Deploy, manage, and maintain ML models in production environments
- Handle Domino user onboarding and access management as per SOPs
3. MLOps & Engineering Development :
- Design, build, and maintain end-to-end automated ML pipelines
Implement CI/CD workflows for ML models :
i. Dev ? Staging ? Production
ii. Automated testing and validation
- Enable Continuous Training (CT) and experimentation frameworks
- Build shared MLOps tools, libraries, and utilities to accelerate model development
Implement automation for :
i. Model lineage
ii. Audit trails
iii. Approval and governance workflows
- Integrate model monitoring and alerting into deployment pipelines
Collaborate closely with :
i. Data Scientists
ii. ML Engineers
iii. Platform & Infra teams to ensure smooth handoff from experimentation to production
Required Skills & Experience :
- Strong experience as a Machine Learning Engineer / MLOps Engineer
- Hands-on experience with Domino Data Lab (mandatory or strong preference)
- Experience deploying and managing ML models in production
- Strong understanding of ML lifecycle management
Experience with :
i. CI/CD pipelines for ML
ii. Automated training and retraining workflows
- Proficiency in monitoring tools such as Grafana
- Experience with incident management tools (ServiceNow preferred)
- Strong understanding of compute resource optimization (CPU, GPU, memory)
Technical Skills :
- MLOps & ML Platforms : Domino Data Lab
- Monitoring & Observability : Grafana
- CI/CD : Jenkins / GitLab CI / similar
- Cloud & Infra : Containers, Kubernetes (preferred)
- Programming : Python (mandatory)
- Version Control : Git
- Incident Management : ServiceNow
Nice to Have :
- Experience with model drift detection and performance monitoring
- Exposure to governance, audit, and compliance frameworks
- Experience in regulated industries (Banking, Pharma, Healthcare)
- Knowledge of data versioning tools and feature stores
Soft Skills :
- Strong troubleshooting and problem-solving skills
- Ability to work in high-availability production environments
- Excellent documentation and communication skills
- Strong collaboration mindset across engineering and data teams
Did you find something suspicious?