MLOps Engineer - PyTorch/Flask

SPNN BUSINESS SERVICES PRIVATE LIMITED

Bangalore

5 - 7 Years

45+ Reviews

MLOps AWS Lambda PostgreSQL DynamoDB Machine Learning LLM AWS SageMaker PyTorch FastAPI Flask

Posted on: 29/07/2025

Job Description

Primary Responsibilities:

- Strategizing and implementing scalable infrastructure for ML or LLM model pipelines using tools like Kubernetes, Docker, and cloud services such as AWS (e.g., AWS Batch, Fargate, Bedrock)

- Manage auto-scaling mechanisms to handle varying workloads and ensure high availability of RestAPIs

- Automate CI/CD pipelines and Lambda functions for model testing, deployment, and updates, reducing manual errors and improving efficiency.

- Amazon SageMaker Pipelines for end-to-end ML workflow automation. Optimize utilizing step-functions

- Set up reproducible workflows for data preparation, model training, and deployment.

- Provision and optimize cloud resources (e.g., GPUs, memory) to meet computational demands of large models like those used in RAG systems

- Use Infrastructure-as-Code (IaC) tools like Terraform to standardize provisioning & deployments

- Automate retraining workflows to keep models updated as data evolves

- Work closely with data scientists, ML engineers, and DevOps teams to integrate models into production environments.

- Implement monitoring tools to track model performance and detect issues like drift or degradation in real time. Monitoring dashboards with real-time alerts for pipeline failures or performance issues &

- Implementing Model Observability frameworks.

Key Skills :

- Experience with AWS services such as Lambda, Bedrock, Batch with Fargate, RDS (PostgreSQL), DynamoDB, SQS, CloudWatch, API Gateway, SageMaker

- Expertise in containerization (Docker & Kubernetes) for consistent deployments & orchestration tools like Airflow, ArgoCD, Kubeflow etc.

- Experience with CI/CD tools (e.g., Jenkins, GitLab CI/CD) and IaC tools like Terraform

- Knowledge of ML frameworks (e.g., PyTorch, TensorFlow) to understand model requirements during deployment

- Experience with RestAPI Frameworks like FastAPIs, Flask

- Familiarity with model observability like Evidently, NannyML, Phoenix and monitoring tools (Grafana etc)