Posted on: 07/10/2025
1. Monitoring :
- Continuously track Domino platform uptime, resource utilization, and health
- Monitor availability, latency, and throughput of real-time and batch endpoints.
- Maintain dashboards (Grafana) for platform and deployment metrics. Ensure proper incident logging for audit and troubleshooting
- Monitor resource utilization (CPU, GPU, memory, network traffic), computational performance, and model aging.
- Keep tabs on changes to dependencies, such as data version or software upgrades.
2. Operations :
- Use ServiceNow for Incident logging. Act as first responder for platform-related incidents. Triage, root cause analysis, and resolution for outages or performance issues. Document RCA and drive preventive measures.
- Coordinate with Domino Data Lab for platform support.
- Deploy and Maintain ML models in production environments.
- Domino User onboarding as per SOPs
3. MLOps Related Development :
- Design, build and maintain automated ML. pipelines to incorporate CI/CD workflows and rapid deployment of models (Dev Staging - Prod) as well as continuous training (CT) and experimentation.
- Build and maintain shared tools/utilities to accelerate model development.
- Build automation for audit trails, model lineage, and approval workflows. Integrate model monitoring into deployment workflows.
- Collaborate with data scientists and engineers to ensure smooth handoff from model development to production and basic knowledge of ServiceNow for ticket management
The job is for:
Did you find something suspicious?
Posted By
Posted in
DevOps / SRE
Functional Area
ML / DL Engineering
Job Code
1556046
Interview Questions for you
View All