HamburgerMenu
hirist

Job Description

Data Engineer - Multi-source ETL & GenAI Pipelines (3+ Years)

Roles and Responsibilities :


- Build and maintain scalable, fault-tolerant data pipelines to support GenAI and analytics workloads across OCR, documents, and case data.

- Manage ingestion and transformation of semi-structured legal documents (PDF, Word, Excel) into structured formats.

- Enable RAG workflows by processing data into chunked, vectorized formats with metadata.

- Handle large-scale ingestion from multiple sources into cloud-native data lakes (S3, GCS), data warehouses (BigQuery, Snowflake), and PostgreSQL.

- Automate pipelines using orchestration tools like Airflow/Prefect, including retry logic, alerting, and metadata tracking.

- Collaborate with ML Engineers to ensure data availability, traceability, and performance for inference and training pipelines.

- Implement data validation and testing frameworks using Great Expectations or dbt.

- Integrate OCR pipelines and post-processing outputs for embedding and document search.


- Design infrastructure for streaming vs batch data needs and optimize for cost, latency, and reliability.

Qualifications :


- Bachelors or Masters degree in Computer Science, Data Engineering, or equivalent.

- 3+ years of experience in building distributed data pipelines and managing multi-source ingestion.

- Proficiency with Python, SQL, and data tools like Pandas, PySpark.

- Experience working with data orchestration tools (Airflow, Prefect), and file formats like Parquet, Avro, JSON.

- Hands-on experience with cloud storage/data warehouse systems (S3, GCS, BigQuery, Redshift).

- Understanding of GenAI and vector database ingestion pipelines is a strong plus.

- Bonus : Experience with OCR tools (Tesseract, Google Document AI), PDF parsing libraries (PyMuPDF), and API-based document processors.


info-icon

Did you find something suspicious?