We are looking for an experienced MLOps Engineer to build, automate, and maintain end-to-end machine learning pipelines and production environments.

The ideal candidate has strong experience with ML model deployment, workflow orchestration, CI/CD automation, cloud platforms, and scalable architecture for real-time or batch ML systems.

You will work closely with data scientists, ML engineers, and DevOps teams to ensure models are efficiently deployed, monitored, optimized, and continuously improved.

Key Responsibilities :

ML Pipeline Development & Automation :

- Build and manage scalable ML pipelines for data preparation, training, validation, and deployment.

- Create automated workflows using tools like Kubeflow, MLflow, Airflow, Vertex AI Pipelines, or SageMaker Pipelines.

- Implement versioning of datasets, models, and experiments.

Model Deployment & Serving :

- Deploy ML models on cloud environments (AWS/GCP/Azure) or on-prem.

- Implement real-time model serving using Docker, Kubernetes, KServe, TorchServe, TensorFlow Serving, or FastAPI.

- Develop APIs for inference and integrate models into production systems.

CI/CD for ML (Continuous Integration & Delivery) :

- Build automated CI/CD pipelines for model training, packaging, and deployment.

- Ensure safe rollouts with canary deployments, A/B tests, and rollback strategies.

- Maintain Git-based workflows for code, model, and pipeline updates.

Monitoring, Observability & Maintenance :

- Implement end-to-end monitoring for model performance, drift detection, data quality, and metrics.

- Set up logging and alerting using Prometheus, Grafana, ELK/EFK, CloudWatch.

- Automate model retraining triggers based on performance thresholds or data drift.

Infrastructure Management :

- Build and maintain cloud-based ML infrastructure (compute, storage, networking).

- Work with IaC tools like Terraform, CloudFormation, or Pulumi.

- Optimize resource usage, GPU allocation, and cost efficiency.

Collaboration & Documentation :

- Work closely with data scientists to productionize notebooks and prototype models.

- Convert experimental code into scalable, maintainable components.

- Document workflows, architecture, pipeline steps, and best practices.

ML Governance, Versioning & Security :

- Implement model registries (MLflow, SageMaker Model Registry, Vertex AI Model Registry).

- Ensure compliance with security, PII handling, privacy, and governance policies.

- Manage secrets, credentials, and secure access for ML systems.

Required Skills & Qualifications :

Technical Skills :

- Strong understanding of ML lifecycle, model deployment, and production ML.

- Proficiency in Python, ML frameworks (PyTorch, TensorFlow, Scikit-learn).

- Hands-on experience with Docker, Kubernetes, Helm charts.

- Experience with MLflow, Kubeflow, Airflow, Jenkins, GitHub Actions, or Azure DevOps.

- Cloud experience with AWS (SageMaker, ECS/EKS), GCP (Vertex AI), or Azure ML.

- Knowledge of monitoring tools, APIs, REST/GraphQL, and microservices.

- Familiarity with feature stores (Feast, Tecton) is a plus.

Soft Skills :

- Strong problem-solving and analytical mindset.

- Excellent collaboration with DS/DE/DevOps teams.

- Clear communication and documentation abilities.

- Ability to work independently and handle fast-paced environments.

Preferred Qualifications :

- Experience with GPU-based training and model optimization.

- Exposure to data engineering tools (Spark, Kafka, Databricks).

- Familiarity with distributed training frameworks (Horovod, DeepSpeed).

- Prior experience deploying LLM or deep learning models