Artificial Intelligence

Machine Learning

NLP

Security Architect - AI

Cloud Architect - ML/AI

Emerging Technologies

DevOps / SRE

CyberSecurity

Quality Assurance

Platform Engineering / SAP/Oracle

Data Engineer - Python/SQL

Masin Projects Pvt. Ltd

Gurgaon/Gurugram

3 - 6 Years

Data Engineering Python SQL ETL Tools Data Build Tool Data Pipeline Pandas Spark Generative AI

Posted on: 31/08/2025

Job Description

Data Engineer - Multi-source ETL & GenAI Pipelines (3+ Years)

Roles and Responsibilities :

- Build and maintain scalable, fault-tolerant data pipelines to support GenAI and analytics workloads across OCR, documents, and case data.

- Manage ingestion and transformation of semi-structured legal documents (PDF, Word, Excel) into structured formats.

- Enable RAG workflows by processing data into chunked, vectorized formats with metadata.

- Handle large-scale ingestion from multiple sources into cloud-native data lakes (S3, GCS), data warehouses (BigQuery, Snowflake), and PostgreSQL.

- Automate pipelines using orchestration tools like Airflow/Prefect, including retry logic, alerting, and metadata tracking.

- Collaborate with ML Engineers to ensure data availability, traceability, and performance for inference and training pipelines.

- Implement data validation and testing frameworks using Great Expectations or dbt.

- Integrate OCR pipelines and post-processing outputs for embedding and document search.

- Design infrastructure for streaming vs batch data needs and optimize for cost, latency, and reliability.

Qualifications :

- Bachelors or Masters degree in Computer Science, Data Engineering, or equivalent.

- 3+ years of experience in building distributed data pipelines and managing multi-source ingestion.

- Proficiency with Python, SQL, and data tools like Pandas, PySpark.

- Experience working with data orchestration tools (Airflow, Prefect), and file formats like Parquet, Avro, JSON.

- Hands-on experience with cloud storage/data warehouse systems (S3, GCS, BigQuery, Redshift).

- Understanding of GenAI and vector database ingestion pipelines is a strong plus.

- Bonus : Experience with OCR tools (Tesseract, Google Document AI), PDF parsing libraries (PyMuPDF), and API-based document processors.

Did you find something suspicious?

Posted By

Krutika

HR Head at Masin Projects Pvt. Ltd

Last Active: 16 Sep 2025

Job Views:
264

Applications: 174

Recruiter Actions: 51

Posted in

Data Engineering

Functional Area

Data Engineering

Job Code

1538261

Jobs by location

Interview Questions for you

View All

How to Write Leave Application for Urgent Work: Format & Samples (2025)

Top 90+ Machine Learning Interview Questions and Answers

Top 40+ Deep Learning Interview Questions and Answers