Posted on: 27/05/2025
Job Summary :
We are seeking a skilled ML Ops Engineer to support and enhance our machine learning operations infrastructure. In this role, you will be responsible for monitoring production services, troubleshooting issues, and collaborating with teams to improve automation and system reliability. You will play a critical role in ensuring seamless model deployment, performance, and integration within our ML platform
Key Responsibilities :
- Monitor support channels and incident queues to proactively identify and address operational issues.
- Investigate and resolve issues reported by automation systems, alerts, or customer feedback.
- Maintain and support online production services for serving ML models, ensuring high availability and performance.
- Collaborate with engineering teams to automate processes and improve operational efficiency.
- Gain a deep understanding of ML platform capabilities and integrations, providing technical insights to enhance system reliability.
- Identify recurring issues and provide feedback to ML platform engineers for continuous improvements.
- Contribute to documentation efforts, ensuring clarity and accuracy for internal teams and stakeholders.
Required Skills & Qualifications :
- Bachelor's or master's degree in computer science or related field.
- Relevant experience of 3 years in Python programming.
- Programming : Proficiency in Python (Mandatory).
- ML Infrastructure : Hands-on experience with Databricks, Tecton, and ML Concepts (Model Deployment, Feature Engineering, Monitoring).
- DevOps & Automation : Strong knowledge of Kubernetes, Jenkins, and GitHub for CI/CD pipelines and infrastructure automation.
- Cloud Computing : Expertise in AWS services related to ML Ops.
- Version Control & Monitoring : Experience with GitHub Actions, observability tools, and system monitoring frameworks.
- Problem-Solving & Communication : Strong analytical skills, ability to debug production issues, and effectively communicate with cross-functional teams.
Preferred Qualifications :
- Experience working with large-scale distributed ML systems.
- Knowledge of Terraform for infrastructure as code.
- Experience with logging and observability best practices for ML models in production.
If you are excited about building scalable ML infrastructure and driving automation, we'd love to hear from you! Apply now to be part of our team.
Did you find something suspicious?
Posted By
Posted in
AI/ML
Functional Area
ML / DL Engineering
Job Code
1486242
Interview Questions for you
View All