Job Summary :

We are seeking a highly skilled Senior Data Engineer with 4 to 8 years of experience in building robust data pipelines and working extensively with PySpark to join our data engineering team.

Key Responsibilities :

Data Pipeline Development :

- Design, build, and maintain scalable data pipelines using PySpark to process large datasets and support data-driven applications and analytics.

ETL Process Automation :

- Develop and automate ETL (Extract, Transform, Load) processes using PySpark, ensuring efficient data processing, transformation, and loading from diverse sources into data lakes, warehouses, or databases.

Distributed Computing with PySpark :

- Leverage Apache Spark and PySpark to process large-scale data in a distributed computing environment, optimizing for performance and scalability.

Cloud Data Solutions :

- Develop and deploy data pipelines and processing frameworks on cloud platforms (AWS, Azure, GCP) using native tools like AWS Glue, Azure Databricks, or Google Dataproc.

Data Integration & Transformation :

- Integrate data from various internal and external sources, ensuring data consistency, quality, and reliability throughout the pipeline.

Performance Optimization :

- Optimize PySpark jobs and pipelines for faster data processing, handling large volumes of data efficiently with minimal latency.

- Proven experience as a Data Engineer or similar role, with a strong background in database development, ETL processes, and software development.

- Proficiency in SQL and scripting languages such as Python, with experience working with relational databases.

- Proficiency in dataProc (PySpark), Pandas or other data processing libraries

- Experience with data modeling, schema design, and optimization techniques for scalability.

- Strong analytical and problem-solving skills, with the ability to troubleshoot complex data issues and optimize data processing pipelines for scale

Required Qualifications :

- 4-8 years of experience in data engineering, with a strong focus on PySpark and large-scale data processing.

Technical Skills :

- Expertise in PySpark for distributed data processing, data transformation, and job optimization.

- Strong proficiency in Python and SQL for data manipulation and pipeline creation.

- Hands-on experience with Apache Spark and its ecosystem, including Spark SQL, Spark Streaming, and PySpark MLlib.

- Solid experience working with ETL tools and frameworks, such as Apache Airflow or similar orchestration tools.