Key Responsibilities :

- Design, develop, and maintain robust data pipelines and ETL/ELT workflows using PySpark, Python, and SQL.

- Build and manage data ingestion and transformation processes from various sources including Hive, Kafka, and cloud-native services.

- Orchestrate workflows using Apache Airflow and ensure timely and reliable data delivery.

- Work with large-scale big data systems to process structured and unstructured datasets.

- Implement data quality checks, monitoring, and alerting mechanisms.

- Collaborate with cross-functional teams including data scientists, analysts, and product managers to understand data requirements.

- Optimize data processing for performance, scalability, and cost-efficiency.

- Ensure compliance with data governance, security, and privacy standards.

Required Skills & Qualifications :

- 5+ years of experience in data engineering or related roles.

- Strong programming skills in Python and PySpark.

- Proficiency in SQL and experience with Hive.

- Hands-on experience with Apache Airflow for workflow orchestration.

- Experience with Kafka for real-time data streaming.

- Solid understanding of big data ecosystems and distributed computing.

- Experience with GCP (BigQuery, Dataflow, Dataproc)

- Ability to work with both structured (e.g., relational databases) and unstructured (e.g., logs, images, documents) data.

- Familiarity with CI/CD tools and version control systems (e.g., Git).

- Knowledge of containerization (Docker) and orchestration (Kubernetes).

- Exposure to data cataloging and governance tools (e.g., AWS Lake Formation, Google Data Catalog).

- Understanding of data modeling and architecture principles.