We're Hiring : MLOps Engineer

Location : Pune

Experience : 7+ Years

Notice Period : Immediate to Short Joiners Preferred

Role Type : Full-Time | On-site / Hybrid (based on company policy)

About the Role :

We are seeking a highly skilled MLOps Engineer to support our rapidly growing AI/ML initiatives, including Generative AI platforms, agentic AI systems, and large-scale model deployments. This role blends advanced DevOps practices with modern AI infrastructure, requiring deep experience in managing cloud-native environments, ML model pipelines, and distributed compute systems.

Youll work alongside AI researchers, data scientists, and software engineers to streamline model development and deployment across cloud platforms with a strong emphasis on reliability, scalability, and automation.

Key Responsibilities :

- Design, build, and maintain CI/CD pipelines for machine learning model training and deployment workflows (including RAG and LLM systems).

- Manage and optimize Kubernetes clusters with GPU-based compute for distributed ML workloads.

- Automate infrastructure provisioning and management using Terraform on GCP (preferred), AWS, or Azure.

- Deploy and manage components such as vector databases, feature stores, and observability tools (e.g., Prometheus, Grafana, ELK).

- Ensure security, resilience, and high availability of all AI workloads and underlying infrastructure.

- Collaborate with AI/ML engineers, data scientists, and platform teams to integrate and deploy end-to-end AI solutions.

- Enable agentic AI workflows with tools like LangChain, LangGraph, CrewAI, and others.

Required Skills & Experience :

- 7+ years in DevOps, MLOps, or Infrastructure Engineering roles.

- 4+ years of experience in cloud-native development (with at least 2+ years in AI/ML-related environments).

- Proficient in Python and scripting languages like Bash.

- Hands-on experience with CI/CD tools such as Jenkins, Harness, GitHub Actions, or ArgoCD.

- Deep understanding of Kubernetes, especially GPU orchestration and autoscaling.

- Experience managing infrastructure on GCP (preferred), AWS, or Azure.

- Strong skills in Terraform for Infrastructure as Code (IaC).

- Sound knowledge of monitoring, logging, and security best practices in ML production environments.

Nice-to-Have (Bonus) Skills :

- Familiarity with MLOps platforms like MLflow, Kubeflow, SageMaker, or Vertex AI.

- Experience with RAG (Retrieval-Augmented Generation) pipelines and prompt engineering.

- Exposure to model fine-tuning, drift detection, and rollback strategies.

- Working knowledge of containerization tools like Docker and container security scanning.