You will engineer infrastructure as code, CI/CD for ML and LLM applications, secure model serving, observability, and runtime cost/performance optimization partnering closely with Data Scientists, AI Product Owners, and Platform/DevOps teams.

Ideal candidates will have 5 to 8 years of production experience with ML platforms (e.g., SageMaker, Azure ML, Databricks), and expertise in Kubernetes based model serving and GitOps automation. You will champion reliability (SLOs/SLIs), compliance, and automation first practices across the ML lifecycle.

Key Responsibilities :

- Design, develop, and document Infrastructure as Code (Terraform) for ML/LLM platform components on AWS/Databricks, implement secure, scalable foundations for data, compute, networking, and secrets.

- Build and maintain GitHub based pipelines (Actions/Workflows) for training, packaging, validation, and deployment of ML/LLM assets (models, evaluation suites, prompts, policies), using GitOps for environment promotion.

- Containerize models using Docker and deploy them primarily through managed endpoints (SageMaker/Azure ML) Kubernetes based serving (KServe/Triton/Seldon) is a plus.

- Operate model registries and feature stores; enforce versioning, lineage, and artifact governance via MLflow/Databricks and cloud native services.

- Implement logs/metrics/traces, performance profiling, and drift/quality monitors; define SLIs/SLOs and on call runbooks; drive incident response and post-mortems with accountability (business hours support rotation).

- Embed DevSecOps: secrets management, IAM/RBAC, vulnerability scanning, image signing, policy as code, least privilege access, backup/DR/resiliency patterns; align with enterprise security standards.

- Operationalize GenAI: prompt/content safety filters, evaluation harnesses (human in the loop), grounding/attribution logging, token cost & latency tracking, and red teaming pipelines integrated into CI/CD.

- Monitor and optimize compute/storage/bandwidth and inference costs; implement right sizing, autoscaling, and caching strategies.

- Partner with Data Scientists to productize models; co design platform features with stakeholders; deliver documentation, templates, and knowledge transfers that accelerate safe reuse.

- Run operations (RUN): Troubleshoot escalations, improve monitoring, automate administration/IRP tasks, and continuously harden reliability, performance, and security across environments.

Required Skills & Qualifications :

Technical Experience :

- Understanding of DevOps concepts such as reference implementation enforcement, use of shared DevOps stacks, infrastructure optimization (performance, cost, HA, resiliency), release management (GitOps best practices), and QA automation frameworks.

- Strong knowledge of AWS ecosystems and Databricks integration.

- Proficiency in Terraform for developing, testing, and maintaining Infrastructure?as?Code to manage cloud services for ML engineering.

- Hands?on experience with CI/CD using GitHub, GitHub Actions, and Workflow automation to support continuous integration, delivery, and deployment of ML assets.

- Strong experience with Docker, Kubernetes is a plus.

- MLflow (tracking/registry), model registries, feature stores, experiment tracking, and lineage management; Databricks and cloud native equivalents.

- Build pipelines for training, testing (unit/integration/e2e), evaluation, and deployment.

- Experience designing or contributing to infrastructure, application, and performance monitoring (logs, metrics, dashboards) and supporting observability strategies.

- Ability to produce efficient, maintainable code in Python; experience troubleshooting and extending Python based services.

Consulting Experience :

- Proven track record in an IT consulting environment, engaging with large enterprises and MNCs in strategic data solutioning projects.

- Experience working with enterprise stakeholders in platform adoption, requirement clarification, effort sizing, and change management for ML platform rollouts.

Leadership & Soft Skills :

- Strong collaboration and communication across Delivery and RUN.

- Excellent communication, documentation, and presentation skills.

- Strong problem-solving, analytical thinking, and strategic vision.

Educational Qualifications :

- Bachelors or Masters degree in Computer Science, Engineering, or a related quantitative field.

Preferred Certifications :

- AWS DevOps Engineer Professional

- AWS Certified Machine Learning Specialty (or Azure DevOps Engineer Expert)

- CKA (Certified Kubernetes Administrator), HashiCorp Terraform Associate

What Were Looking For :

- Self-starters who are highly motivated, ambitious, and eager to challenge the status quo.

- Builders who combine scientific rigor with pragmatic engineering and can balance accuracy, latency, and cost.

- Effective leaders who collaborate openly, freely share knowledge and elevate team performance.

- Straightforward, results-oriented individuals who value impact and accountability.

- Adaptable experts who stay on top of fast-evolving AI technologies and practices.

Why Join Us ?

- Opportunity to shape and build an AI product portfolio that delivers meaningful business impact for SE Regions.

- Work alongside a motivated and innovative team that values learning, ownership, and excellence.

- Thrive in a culture that challenges the status quo and embraces diverse perspectives.

The job is for:

Women candidates preferred