We're building Voxy - a deeply personalized AI companion that speaks, listens, and feels real. This isn't a ChatGPT wrapper. We train and fine-tune our own models, run our own inference stacks, and obsess over every millisecond of latency and every token of quality.

We're looking for an AI Engineer who has gone deep - someone who has read the Attention Is All You Need paper and implemented it, who knows why LoRA works at a mathematical level, and who gets uncomfortable when someone says "just call the OpenAI API."

If you've fine-tuned a model, shipped it to production, and then debugged why it degraded three weeks later - we want to talk.

What You'll Own :

- LLM Fine-tuning for Companion AI : SFT, RLHF, DPO pipelines on open-source base models (Gemma, LLaMA, Mistral family, etc ) to build Voxy's conversational personality engine

- Efficient Training : PEFT methods including LoRA, QLoRA, adapter layers; balancing quality vs. compute budget

- Text-to-Speech & Voice : Fine-tune TTS models (XTTS, Tortoise, StyleTTS2 or similar); work on prosody, emotion, and low-latency streaming voice output

- Text-to-Image : Fine-tune diffusion models (Stable Diffusion, FLUX) using DreamBooth, LoRA, ControlNet for companion avatar generation

- Quantization & Inference : GPTQ, AWQ, GGUF, bitsandbytes; optimize models for GPU and edge targets without killing quality

- Evaluation Systems : Build robust eval harnesses : automated metrics, human eval pipelines, regression detection, and model behavior monitoring

- Data Engineering : Own training data - curation, dedup, quality filtering, formatting.

What We're Looking For :

Must-Haves :

- 3 - 6 years of experience, with at least 2 years directly in ML/AI model work (not just ML infra or MLOps)

- Hands-on fine-tuning experience on transformer-based LLMs - not just running training scripts, but understanding why hyperparameters matter

- Deep understanding of attention mechanisms - multi-head attention, RoPE, GQA, Flash Attention - can reason about tradeoffs, not just use them

- Practical experience with PEFT : LoRA rank selection, target modules, merging strategies, catastrophic forgetting mitigation

- Strong grasp of quantization : INT4/INT8, calibration, quality-compute tradeoffs across GPTQ/AWQ/GGUF

- Experience with model evaluation : BLEU/ROUGE are a floor, not a ceiling; you've built task-specific evals

- Data-first mindset : Can write complex SQL, has built data pipelines for training, understands data quality deeply

- Strong DSA fundamentals : Can pass a coding round at a top product company; writes efficient, production-quality Python

Good to Have :

- Prior work on voice models - TTS fine-tuning, voice cloning, vocoder pipelines (HiFi-GAN, BigVGAN), streaming inference

- Experience with diffusion models - training dynamics, classifier-free guidance, LoRA for personalization

- Familiarity with multi-modal architectures - how vision encoders interface with LLMs (LLaVA, Idefics, etc.)

- Contributed to open-source ML projects or published papers/technical blogs with real traction

- Experience with distributed training - DDP, FSDP, DeepSpeed ZeRO stages

Proven Track Record - This Is Non-Negotiable :

You must be able to show at least one of the following :

- A fine-tuned model you shipped to production with measurable impact (latency, quality, cost)

- An open-source project, HuggingFace model card, or technical writeup that demonstrates depth - not just a notebook

- A prior role where you owned an AI model end-to-end, from data to deployment

- Research/academic work in NLP, speech, or generative models with real results