Description :
Core Responsibilities :
The Principal MLOps Engineer will focus on operationalizing and managing ML workflows at scale :
- ML Platform Development : Design, implement, and maintain robust, scalable, and secure MLOps platforms and Continuous Integration/Continuous Delivery (CI/CD) pipelines specifically tailored for machine learning models.
- Cloud Infrastructure Management : Leverage deep expertise in AWS to provision, configure, and manage the infrastructure necessary for high-volume ML model training, serving, and deployment.
- Workflow Automation : Develop sophisticated automation scripts and tools to streamline the ML lifecycle, including environment provisioning, resource scaling, automated testing, and zero-downtime deployments.
- Monitoring & Observability : Implement comprehensive monitoring systems to track ML model performance, detect data drift, ensure infrastructure health, and maintain the reliability of production ML services using tools like Amazon CloudWatch and Amazon SageMaker Model Monitor.
- Security & Compliance : Ensure all MLOps practices adhere to enterprise security standards and regulatory compliance requirements.
- Collaboration & Mentorship : Work closely with Data Scientists and Software Engineers to transition models from development to production and provide technical leadership and guidance on MLOps best practices.
Technical Skills : AWS & Cloud :
Deep Experience in AWS Cloud Services, specifically :
- Amazon SageMaker : Hands-on experience with all aspects (training, deployment, Pipelines, Model Monitor).
- Core AWS Services : S3, Lambda, Step Functions, ECR (Elastic Container Registry), and CloudWatch.
Technical Skills : Automation & Deployment :
- CI/CD Tools : Experience implementing and managing CI/CD pipelines using tools like Jenkins, GitHub Actions, or AWS CodeBuild/CodePipeline.
- Containerization : Hands-on expertise with Docker for creating portable and reproducible environments.
- Orchestration : Experience with container orchestration using Amazon ECS (Elastic Container Service) or Amazon EKS (Elastic Kubernetes Service).
Technical Skills : ML Lifecycle :
- Lifecycle Understanding : Comprehensive understanding of the end-to-end machine learning lifecycle, including feature engineering, model training, deployment, and monitoring.
- Versioning & Tracking : Familiarity with data versioning and model tracking tools (e.g., DVC, MLflow).
Non-Technical Skills :
- Collaboration : Strong communication and proven cross-functional collaboration skills to effectively work with data science, engineering, and business teams.