HamburgerMenu
hirist

MLOps & LLMOps Engineer

Catalyst IQ
Multiple Locations
6 - 10 Years
star-icon
4.4white-divider2+ Reviews

Posted on: 13/11/2025

Job Description

We are looking for a highly skilled MLOps & LLM Ops Engineer with strong expertise in deploying, automating, and monitoring AI/ML models - including Large Language Models (LLMs) - in production environments. The ideal candidate will have hands-on experience in CI/CD automation, container orchestration, data pipelines, LangChain, and cloud deployment across Azure/AWS. You will collaborate with data scientists, ML engineers, and customer architects to ensure seamless end-to-end delivery of scalable, high-performing AI systems.


Key Responsibilities :


1. Model Deployment & Automation :


- Automate the full lifecycle of AI/ML model deployment, including packaging, orchestration, scaling, and rollout strategies.


- Implement automated workflows for data, model versioning, and experiment tracking using tools like MLflow or similar systems.


- Deploy Large Language Models (LLMs) to production using frameworks such as LangChain, Flask, FastAPI, or custom

microservices.


- Containerize and orchestrate model services using Docker & Kubernetes, enabling highly available and fault-tolerant inference pipelines.


2. CI/CD & Infrastructure Automation :


- Build and maintain robust CI/CD pipelines using Git, Jenkins, GitHub Actions, or GitLab CI for continuous integration, testing, and deployment of ML solutions.


- Implement infrastructure-as-code (IaC) for automated provisioning of cloud resources (Terraform or equivalent).


- Automate deployment workflows for API endpoints, microservices, feature stores, and data processing pipelines.


3. Data Pipelines & Real-Time Processing :


- Design, deploy, and manage data ingestion and processing pipelines using Airflow, Kafka, and RabbitMQ.


- Ensure reliable, scalable, and secure data pipelines that support both training and inference workflows.


- Optimize data freshness, batch scheduling, and streaming performance for high-throughput model operations.


4. LLM & Foundation Model Operations :


- Integrate and operationalize foundation model APIs such as OpenAI, Anthropic, Gemini, Cohere, etc.


- Deploy custom or fine-tuned LLMs (GPT, Llama, Mistral, etc.) using LangChain or custom inference frameworks.


- Implement prompt management, evaluation, caching, vector store integrations, and retrieval-augmented generation (RAG)

pipelines.


- Ensure high performance, low latency, and reliability of LLM-based production systems.


5. Cloud Deployment & Infrastructure Management :


- Deploy ML workloads in Azure or AWS using services like Kubernetes (AKS/EKS), Lambda, EC2, S3/ADLS, API Gateway, Azure

Functions, etc.


- Monitor and optimize infrastructure cost, performance, and scalability for ML and LLM systems.


- Collaborate with customer architects to define, plan, and execute end-to-end deployments and solution architectures.


6. Monitoring, Observability & Performance Optimization :


- Implement and maintain observability stacks for model performance monitoring, including :

  • Latency, throughput, drift detection
  • Model accuracy and quality metrics
  • Resource utilization, autoscaling behavior

- Use tools like Prometheus, Grafana, ELK, Datadog, or cloud-native monitoring solutions.


- Troubleshoot production issues and perform root cause analysis across models, pipelines, and infrastructure.


Required Skills & Qualifications :


- Strong hands-on experience in MLOps, production ML workflows, and automation.


- Expertise in CI/CD tools (Git, Jenkins, GitHub Actions, GitLab CI).


- Strong experience with Docker and Kubernetes for model containerization and deployment.


- Practical knowledge of MLflow, LangChain, and experiment tracking/versioning systems.


- Experience with Airflow, Kafka, RabbitMQ for large-scale data workflow orchestration.


- Experience working with foundation model APIs (OpenAI, Anthropic, etc.).


- Hands-on deployment experience on Azure and/or AWS cloud platforms.


- Familiarity with performance monitoring tools (Prometheus, Grafana, Datadog, CloudWatch, etc.).


- Solid understanding of distributed systems, microservices, and cloud-native architectures.


- Strong communication, analytical, and debugging skills.


- Ability to work in fast-paced environments and manage complex deployments.


Preferred (Nice-to-Have) :


- Knowledge of vector databases (Pinecone, Weaviate, FAISS, Chroma).


- Experience with RAG pipelines, semantic search, embeddings, or LLM orchestration frameworks.


- Exposure to model optimization techniques such as quantization, distillation, or low-latency inference optimization.


- Hands-on experience with Terraform, Helm, or ArgoCD.


- Experience with GPU-based deployments and optimization in cloud platforms.

info-icon

Did you find something suspicious?