Posted on: 29/07/2025
Primary Responsibilities:
- Strategizing and implementing scalable infrastructure for ML or LLM model pipelines using tools like Kubernetes, Docker, and cloud services such as AWS (e.g., AWS Batch, Fargate, Bedrock)
- Manage auto-scaling mechanisms to handle varying workloads and ensure high availability of RestAPIs
- Automate CI/CD pipelines and Lambda functions for model testing, deployment, and updates, reducing manual errors and improving efficiency.
- Amazon SageMaker Pipelines for end-to-end ML workflow automation. Optimize utilizing step-functions
- Set up reproducible workflows for data preparation, model training, and deployment.
- Provision and optimize cloud resources (e.g., GPUs, memory) to meet computational demands of large models like those used in RAG systems
- Use Infrastructure-as-Code (IaC) tools like Terraform to standardize provisioning & deployments
- Automate retraining workflows to keep models updated as data evolves
- Work closely with data scientists, ML engineers, and DevOps teams to integrate models into production environments.
- Implement monitoring tools to track model performance and detect issues like drift or degradation in real time. Monitoring dashboards with real-time alerts for pipeline failures or performance issues &
- Implementing Model Observability frameworks.
Key Skills :
- Experience with AWS services such as Lambda, Bedrock, Batch with Fargate, RDS (PostgreSQL), DynamoDB, SQS, CloudWatch, API Gateway, SageMaker
- Expertise in containerization (Docker & Kubernetes) for consistent deployments & orchestration tools like Airflow, ArgoCD, Kubeflow etc.
- Experience with CI/CD tools (e.g., Jenkins, GitLab CI/CD) and IaC tools like Terraform
- Knowledge of ML frameworks (e.g., PyTorch, TensorFlow) to understand model requirements during deployment
- Experience with RestAPI Frameworks like FastAPIs, Flask
- Familiarity with model observability like Evidently, NannyML, Phoenix and monitoring tools (Grafana etc)
Did you find something suspicious?