A leading consulting firm operating in the Enterprise Generative AI and Large Language Model (LLM) services sector, delivering production-grade LLM solutions, retrieval-augmented systems, and custom generative AI products for enterprise clients across domains. The team focuses on building secure, scalable, low-latency inference services and automating model lifecycle workflows for on-prem and cloud deployments.

Position : LLM Engineer - On-site (India). We are hiring an experienced LLM engineer to design, fine-tune, and deploy LLM-based solutions that power search, summarization, agents, and domain-specific assistants.

Role & Responsibilities :

- Design, fine-tune, and validate LLMs for production use-cases : Instruction tuning, supervised fine-tuning, and parameter-efficient tuning (LoRA/adapters)

- Implement retrieval-augmented generation (RAG) pipelines : Embeddings, vector search, chunking, and context assembly for high-recall responses.

- Optimize inference for latency and cost : Quantization, model pruning, batching, and deployment with optimized runtimes (CUDA, Triton, bitsandbytes where applicable).

- Build backend services and APIs to serve LLM inference and orchestration using containerized deployments (Docker/Kubernetes) and CI/CD pipelines.

- Collaborate with product, data engineering, and ML teams to integrate LLMs into production flows, monitor model performance, and set up automated retraining/rollbacks.

- Create reproducible training pipelines, implement evaluation suites, and produce documentation and runbooks for model governance and observability.

Skills & Qualifications :

Must-Have :

- 4+ years of hands-on experience working with LLMs or advanced NLP models in production contexts.

- Proficiency in Python for ML engineering and model development.

- Experience with PyTorch and Hugging Face Transformers for training and fine-tuning.

- Practical experience implementing RAG and vector search using tools like FAISS or similar

vector databases.

- Familiarity with LangChain (or equivalent orchestration) and integration with LLM APIs (OpenAI, Anthropic, etc.).

- Experience containerizing and deploying ML services using Docker; familiarity with Kubernetes is a plus.

Preferred :

- Experience with inference optimizations : quantization (bitsandbytes), Triton, or GPU accelerated serving.

- Exposure to distributed training frameworks (DeepSpeed) and cloud MLOps platforms (SageMaker, Azure ML, GCP AI Platform).

- Knowledge of monitoring, logging, and model-evaluation frameworks for production LLMs (MLflow, Prometheus, Grafana).

Benefits & Culture Highlights :

- Collaborative, engineering-driven culture with strong focus on ownership and rapid iteration.

- Opportunity to build end-to-end LLM products for enterprise clients and influence architecture decisions.

- On-site role with hands-on access to GPU infrastructure and cross-functional product teams.

Skills : pytorch, cuda, docker, python, agentic, llm