HamburgerMenu
hirist

Data Engineer - Python/SQL/ETL

Transcend Digital
Chennai
4 - 7 Years

Posted on: 06/10/2025

Job Description

Description :

Role Overview :

You will own the data pipeline powering our LLM training and fine-tuning. This includes ingestion, cleaning, deduplication, and building high-quality datasets from structured/unstructured sources.

Responsibilities :

- Design ETL pipelines for text, PDFs, and structured data.

- Implement data deduplication, filtering (toxicity, PII), and normalization.

- Train and manage tokenizers (SentencePiece/BPE).

- Build datasets for supervised fine-tuning and evaluation.

- Work closely with domain experts to generate instruction/response pairs.

Requirements :


- Strong in Python, SQL, and data wrangling frameworks (Pandas, Spark).

- Experience with large text datasets, cleaning, preprocessing.

- Familiarity with NLP-specific preprocessing (chunking, embeddings).

- Knowledge of cloud data storage (S3/GCS/Blob).

- Bonus : Prior experience in AI/ML pipelines.


info-icon

Did you find something suspicious?