The ideal candidate will have hands-on experience in building scalable data pipelines using PySpark, Python, and SQL, along with exposure to Java and Pandas for data manipulation and processing.

Key Responsibilities :

- Design, develop, and maintain scalable data pipelines using Python and PySpark

- Process and analyze large datasets using Pandas and Spark DataFrames

- Write optimized queries using SQL for data extraction and transformation

- Work with Java-based components where required in the data ecosystem

- Perform data cleansing, transformation, and validation

- Optimize data workflows for performance and scalability

- Collaborate with cross-functional teams including Data Engineers and stakeholders

- Ensure data quality, integrity, and consistency

Required Skills :

- Strong experience in Python, PySpark, Pandas, and SQL

- Should have 5+ years of experience in similar role

- Good knowledge of Java (for integration or backend support)

- Hands-on experience with Apache Spark (RDD, DataFrames, Spark SQL)

- Strong understanding of ETL processes and data pipelines

- Experience with big data tools (Hadoop, Hive, etc.)

- Strong problem-solving and analytical skills

Preferred Skills :

- Experience with Airflow or other orchestration tools

- Exposure to cloud platforms (AWS / Azure / GCP)

- Knowledge of data warehousing and data lakes

- Familiarity with CI/CD and version control (Git)