Posted on: 13/12/2025
Description :
About the Role :
We are seeking a highly capable MLOps Engineer who can manage, monitor, and operationalize Machine Learning models in production. The ideal candidate will be responsible for ensuring model uptime, monitoring model performance, troubleshooting failures, and redeploying ML pipelines efficiently. This role requires strong experience with AWS cloud services, CI/CD, containerization, and ML deployment frameworks.
Key Responsibilities :
- Continuously monitor production ML models for drift, latency, failures, and performance degradation.
- Investigate and troubleshoot issues in ML pipelines, data inconsistencies, model failures, and system errors.
- Redeploy models on AWS using automation workflows or on-demand processes when failures occur.
- Configure alerts, dashboards, and metrics for proactive monitoring.
- Deploy and manage ML models on AWS SageMaker, H2O, Docker containers, and custom model-serving environments.
- Build and maintain automated CI/CD pipelines for model training, testing, validation, and deployment using Jenkins or similar tools.
- Manage model versioning, rollback strategies, and reproducible deployments.
- Work with AWS services including EC2, ECR, EMR, Bedrock, S3, IAM, CloudWatch, and Lambda.
- Containerize training and inference workloads using Docker.
- Optimize infrastructure for scalability, cost, and reliability.
- Automate deployment workflows using CI/CD tools like Jenkins and Bitbucket pipelines.
- Implement best practices in code management, artifact storage, and configuration management.
- Ensure secure and compliant ML operations across cloud environments.
- Work closely with data scientists to productionize ML models, automate training workflows, and optimize inference pipelines.
- Collaborate with engineering, product, and analytics teams to resolve operational issues and enhance model lifecycle management.
Required Skills & Experience :
- 5-7 years of hands-on experience in MLOps, DevOps, or Machine Learning Engineering.
- Strong expertise in AWS SageMaker, H2O, and model deployment frameworks.
- Proficiency in AWS EC2, ECR, EMR, Bedrock, S3, IAM, VPC, and related AWS services.
- Strong command of Docker for building and managing containerized ML workloads.
- Experience with Bitbucket, Git workflows, and CI/CD tools like Jenkins.
- Strong scripting/programming skills (Python, Bash preferred).
- Experience in model monitoring tools, logging, APM solutions, and automated alerting systems.
- Ability to quickly diagnose and resolve ML pipeline failures.
- Strong understanding of model lifecycle, retraining, drift detection, and performance metrics.
- Excellent communication and cross-team collaboration skills.
- Detail-oriented, proactive, and capable of operating in fast-paced environments.
Good to Have :
- Experience with AWS Bedrock for GenAI workflows.
- Exposure to Kubernetes (EKS) or serverless ML deployments.
- Familiarity with H2O Driverless AI or AutoML platforms.
- Prior experience setting up fully automated retraining pipelines.
Did you find something suspicious?
Posted by
Posted in
DevOps / SRE
Functional Area
DevOps / Cloud
Job Code
1589691
Interview Questions for you
View All