HamburgerMenu
hirist

Pyspark Developer

SUNWARE TECHNOLOGIES PRIVATE LIMITED
Multiple Locations
5 - 15 Years

Posted on: 09/12/2025

Job Description

Description :

Role : Pyspark Developer

Location : Chennai, Hyderabad, Kolkata

Experience : 5-15 years

Key Responsibilities :


- Design and build robust, scalable ETL/ELT pipelines using PySpark to ingest data from diverse sources (databases, logs, APIs, files).

- Transform and curate raw transactional and log data into analysis-ready datasets in the Data Hub and analytical data marts.

- Develop reusable and parameterized Spark jobs for batch and micro-batch processing.

- Optimize performance and scalability of PySpark jobs across large data volumes.

- Ensure data quality, consistency, lineage, and proper documentation across ingestion flows.

- Collaborate with Data Architects, Modelers, and Data Scientists to implement ingestion logic

aligned with business needs.

- Work with cloud-based data platforms (e.g., AWS S3, Glue, EMR, Redshift) for data movement and storage.

- Support version control, CI/CD, and infrastructure-as-code where applicable

Required Skills & Qualifications :


- 5+ years of experience in data engineering, with strong focus on PySpark/Spark for big data processing.

- Expertise in building data pipelines and ingestion frameworks from relational, semi-

structured (JSON, XML), and unstructured sources (logs, PDFs).

- Proficiency in Python with strong knowledge of data processing libraries.

- Strong SQL skills for querying and validating data in platforms like Amazon Redshift,

PostgreSQL, or similar.

- Experience with distributed computing frameworks (e.g., Spark on EMR, Databricks).

- Familiarity with workflow orchestration tools (e.g., AWS Step Functions, or similar).

- Solid understanding of data lake / data warehouse architectures and data modeling basics.

Preferred Qualifications :


- Experience with AWS data services : Glue, S3, Redshift, Lambda, CloudWatch, etc.

- Familiarity with Delta Lake or similar for large-scale data storage.

- Exposure to real-time streaming frameworks (e.g., Spark Structured Streaming, Kafka).

- Knowledge of data governance, lineage, and cataloging tools (e.g., AWS Glue Catalog, Apache

Atlas).

- Understanding of DevOps/CI-CD pipelines for data projects using Git, Jenkins, or similar tools.


info-icon

Did you find something suspicious?