Posted on: 06/10/2025
Description :
Role Overview :
You will own the data pipeline powering our LLM training and fine-tuning. This includes ingestion, cleaning, deduplication, and building high-quality datasets from structured/unstructured sources.
Responsibilities :
- Design ETL pipelines for text, PDFs, and structured data.
- Implement data deduplication, filtering (toxicity, PII), and normalization.
- Train and manage tokenizers (SentencePiece/BPE).
- Build datasets for supervised fine-tuning and evaluation.
- Work closely with domain experts to generate instruction/response pairs.
Requirements :
- Experience with large text datasets, cleaning, preprocessing.
- Familiarity with NLP-specific preprocessing (chunking, embeddings).
- Knowledge of cloud data storage (S3/GCS/Blob).
- Bonus : Prior experience in AI/ML pipelines.
Did you find something suspicious?
Posted By
Posted in
Data Engineering
Functional Area
Data Engineering
Job Code
1556075
Interview Questions for you
View All