You will own the data pipeline powering our LLM training and fine-tuning. This includes ingestion, cleaning, deduplication, and building high-quality datasets from structured/unstructured sources.

Responsibilities :

- Design ETL pipelines for text, PDFs, and structured data.

- Implement data deduplication, filtering (toxicity, PII), and normalization.

- Train and manage tokenizers (SentencePiece/BPE).

- Build datasets for supervised fine-tuning and evaluation.

- Work closely with domain experts to generate instruction/response pairs.

Requirements :

- Strong in Python, SQL, and data wrangling frameworks (Pandas, Spark).

- Experience with large text datasets, cleaning, preprocessing.

- Familiarity with NLP-specific preprocessing (chunking, embeddings).

- Knowledge of cloud data storage (S3/GCS/Blob).

- Bonus : Prior experience in AI/ML pipelines.

Did you find something suspicious?

Similar jobs that you might be interested in

Posted by

Rajgopal

HR at Transcend Digital

Last Active: 12 Dec 2025

Job Views:
47

Applications: 31

Recruiter Actions: 1

Posted in

Data Engineering

Functional Area

Data Engineering

Job Code

1556075

Jobs by location

Interview Questions for you

View All

How to Write Leave Application for Urgent Work: Format & Samples (2025)

Top 90+ Machine Learning Interview Questions and Answers

Top 40+ Deep Learning Interview Questions and Answers