Posted on: 03/03/2026
Note : Women Candidates Preferred
You will engineer infrastructure as code, CI/CD for ML and LLM applications, secure model serving, observability, and runtime cost/performance optimization partnering closely with Data Scientists, AI Product Owners, and Platform/DevOps teams.
Ideal candidates will have 5 to 8 years of production experience with ML platforms (e.g., SageMaker, Azure ML, Databricks), and expertise in Kubernetes based model serving and GitOps automation. You will champion reliability (SLOs/SLIs), compliance, and automation first practices across the ML lifecycle.
Key Responsibilities :
- Design, develop, and document Infrastructure as Code (Terraform) for ML/LLM platform components on AWS/Databricks, implement secure, scalable foundations for data, compute, networking, and secrets.
- Build and maintain GitHub based pipelines (Actions/Workflows) for training, packaging, validation, and deployment of ML/LLM assets (models, evaluation suites, prompts, policies), using GitOps for environment promotion.
- Containerize models using Docker and deploy them primarily through managed endpoints (SageMaker/Azure ML) Kubernetes based serving (KServe/Triton/Seldon) is a plus.
- Operate model registries and feature stores; enforce versioning, lineage, and artifact governance via MLflow/Databricks and cloud native services.
- Implement logs/metrics/traces, performance profiling, and drift/quality monitors; define SLIs/SLOs and on call runbooks; drive incident response and post-mortems with accountability (business hours support rotation).
- Embed DevSecOps: secrets management, IAM/RBAC, vulnerability scanning, image signing, policy as code, least privilege access, backup/DR/resiliency patterns; align with enterprise security standards.
- Operationalize GenAI: prompt/content safety filters, evaluation harnesses (human in the loop), grounding/attribution logging, token cost & latency tracking, and red teaming pipelines integrated into CI/CD.
- Monitor and optimize compute/storage/bandwidth and inference costs; implement right sizing, autoscaling, and caching strategies.
- Partner with Data Scientists to productize models; co design platform features with stakeholders; deliver documentation, templates, and knowledge transfers that accelerate safe reuse.
- Run operations (RUN): Troubleshoot escalations, improve monitoring, automate administration/IRP tasks, and continuously harden reliability, performance, and security across environments.
Required Skills & Qualifications :
Technical Experience :
- Understanding of DevOps concepts such as reference implementation enforcement, use of shared DevOps stacks, infrastructure optimization (performance, cost, HA, resiliency), release management (GitOps best practices), and QA automation frameworks.
- Strong knowledge of AWS ecosystems and Databricks integration.
- Proficiency in Terraform for developing, testing, and maintaining Infrastructure?as?Code to manage cloud services for ML engineering.
- Hands?on experience with CI/CD using GitHub, GitHub Actions, and Workflow automation to support continuous integration, delivery, and deployment of ML assets.
- Strong experience with Docker, Kubernetes is a plus.
- MLflow (tracking/registry), model registries, feature stores, experiment tracking, and lineage management; Databricks and cloud native equivalents.
- Build pipelines for training, testing (unit/integration/e2e), evaluation, and deployment.
- Experience designing or contributing to infrastructure, application, and performance monitoring (logs, metrics, dashboards) and supporting observability strategies.
- Ability to produce efficient, maintainable code in Python; experience troubleshooting and extending Python based services.
Consulting Experience :
- Proven track record in an IT consulting environment, engaging with large enterprises and MNCs in strategic data solutioning projects.
- Experience working with enterprise stakeholders in platform adoption, requirement clarification, effort sizing, and change management for ML platform rollouts.
Leadership & Soft Skills :
- Strong collaboration and communication across Delivery and RUN.
- Excellent communication, documentation, and presentation skills.
- Strong problem-solving, analytical thinking, and strategic vision.
Educational Qualifications :
- Bachelors or Masters degree in Computer Science, Engineering, or a related quantitative field.
Preferred Certifications :
- AWS DevOps Engineer Professional
- AWS Certified Machine Learning Specialty (or Azure DevOps Engineer Expert)
- CKA (Certified Kubernetes Administrator), HashiCorp Terraform Associate
What Were Looking For :
- Self-starters who are highly motivated, ambitious, and eager to challenge the status quo.
- Builders who combine scientific rigor with pragmatic engineering and can balance accuracy, latency, and cost.
- Effective leaders who collaborate openly, freely share knowledge and elevate team performance.
- Straightforward, results-oriented individuals who value impact and accountability.
- Adaptable experts who stay on top of fast-evolving AI technologies and practices.
Why Join Us ?
- Opportunity to shape and build an AI product portfolio that delivers meaningful business impact for SE Regions.
- Work alongside a motivated and innovative team that values learning, ownership, and excellence.
- Thrive in a culture that challenges the status quo and embraces diverse perspectives.
The job is for:
Did you find something suspicious?