Artificial Intelligence

Machine Learning

NLP

Security Architect - AI

Cloud Architect - ML/AI

Emerging Technologies

DevOps / SRE

CyberSecurity

Quality Assurance

Platform Engineering / SAP/Oracle

MLOps & LLMOps Engineer

Catalyst IQ

Multiple Locations

6 - 10 Years

4.4

2+ Reviews

MLOps LLM DevOps Cloud LangChain Flask CI/CD Jenkins Docker Kubernetes

Posted on: 13/11/2025

Job Description

We are looking for a highly skilled MLOps & LLM Ops Engineer with strong expertise in deploying, automating, and monitoring AI/ML models - including Large Language Models (LLMs) - in production environments. The ideal candidate will have hands-on experience in CI/CD automation, container orchestration, data pipelines, LangChain, and cloud deployment across Azure/AWS. You will collaborate with data scientists, ML engineers, and customer architects to ensure seamless end-to-end delivery of scalable, high-performing AI systems.

Key Responsibilities :

1. Model Deployment & Automation :

- Automate the full lifecycle of AI/ML model deployment, including packaging, orchestration, scaling, and rollout strategies.

- Implement automated workflows for data, model versioning, and experiment tracking using tools like MLflow or similar systems.

- Deploy Large Language Models (LLMs) to production using frameworks such as LangChain, Flask, FastAPI, or custom

microservices.

- Containerize and orchestrate model services using Docker & Kubernetes, enabling highly available and fault-tolerant inference pipelines.

2. CI/CD & Infrastructure Automation :

- Build and maintain robust CI/CD pipelines using Git, Jenkins, GitHub Actions, or GitLab CI for continuous integration, testing, and deployment of ML solutions.

- Implement infrastructure-as-code (IaC) for automated provisioning of cloud resources (Terraform or equivalent).

- Automate deployment workflows for API endpoints, microservices, feature stores, and data processing pipelines.

3. Data Pipelines & Real-Time Processing :

- Design, deploy, and manage data ingestion and processing pipelines using Airflow, Kafka, and RabbitMQ.

- Ensure reliable, scalable, and secure data pipelines that support both training and inference workflows.

- Optimize data freshness, batch scheduling, and streaming performance for high-throughput model operations.

4. LLM & Foundation Model Operations :

- Integrate and operationalize foundation model APIs such as OpenAI, Anthropic, Gemini, Cohere, etc.

- Deploy custom or fine-tuned LLMs (GPT, Llama, Mistral, etc.) using LangChain or custom inference frameworks.

- Implement prompt management, evaluation, caching, vector store integrations, and retrieval-augmented generation (RAG)

pipelines.

- Ensure high performance, low latency, and reliability of LLM-based production systems.

5. Cloud Deployment & Infrastructure Management :

- Deploy ML workloads in Azure or AWS using services like Kubernetes (AKS/EKS), Lambda, EC2, S3/ADLS, API Gateway, Azure

Functions, etc.

- Monitor and optimize infrastructure cost, performance, and scalability for ML and LLM systems.

- Collaborate with customer architects to define, plan, and execute end-to-end deployments and solution architectures.

6. Monitoring, Observability & Performance Optimization :

- Implement and maintain observability stacks for model performance monitoring, including :

Latency, throughput, drift detection
Model accuracy and quality metrics
Resource utilization, autoscaling behavior

- Use tools like Prometheus, Grafana, ELK, Datadog, or cloud-native monitoring solutions.

- Troubleshoot production issues and perform root cause analysis across models, pipelines, and infrastructure.

Required Skills & Qualifications :

- Strong hands-on experience in MLOps, production ML workflows, and automation.

- Expertise in CI/CD tools (Git, Jenkins, GitHub Actions, GitLab CI).

- Strong experience with Docker and Kubernetes for model containerization and deployment.

- Practical knowledge of MLflow, LangChain, and experiment tracking/versioning systems.

- Experience with Airflow, Kafka, RabbitMQ for large-scale data workflow orchestration.

- Experience working with foundation model APIs (OpenAI, Anthropic, etc.).

- Hands-on deployment experience on Azure and/or AWS cloud platforms.

- Familiarity with performance monitoring tools (Prometheus, Grafana, Datadog, CloudWatch, etc.).

- Solid understanding of distributed systems, microservices, and cloud-native architectures.

- Strong communication, analytical, and debugging skills.

- Ability to work in fast-paced environments and manage complex deployments.

Preferred (Nice-to-Have) :

- Knowledge of vector databases (Pinecone, Weaviate, FAISS, Chroma).

- Experience with RAG pipelines, semantic search, embeddings, or LLM orchestration frameworks.

- Exposure to model optimization techniques such as quantization, distillation, or low-latency inference optimization.

- Hands-on experience with Terraform, Helm, or ArgoCD.

- Experience with GPU-based deployments and optimization in cloud platforms.

Did you find something suspicious?

Posted By

Richa Shrivastava

Sr TA at Catalyst IQ

Last Active: 14 Nov 2025

Job Views:
15

Applications: 21

Recruiter Actions: 0

Posted in

AI/ML

Functional Area

DevOps / Cloud

Job Code

1574386

Jobs by location

Interview Questions for you

View All

Top 25 LLM Interview Questions and Answers

Top 50+ GitHub Interview Questions and Answers

Top 25+ Database Testing Interview Questions and Answers