You will be responsible for designing, building, and optimizing massive-scale, low-latency data pipelines that support real-time analytics and Machine Learning applications.

Key Responsibilities:

- Design and build highly optimized, production-grade ETL/ELT pipelines using Apache Spark (PySpark/Scala) to process petabytes of data.

- Architect and manage the Data Lakehouse using open-source technologies like Delta Lake or Apache Hudi for ACID transactions and data quality.

- Integrate and process real-time data streams using technologies such as Apache Kafka or Kinesis.

- Implement automated data quality checks, monitoring, and lineage tracking across all data products.

- Collaborate with the infrastructure team to automate data platform deployment and scaling on the cloud (AWS EMR/Glue or Databricks) using Terraform.

- Optimize data warehousing and querying performance in platforms like Snowflake or Google BigQuery.

Technical Skills Required:

- Expert proficiency and tuning experience with Apache Spark (PySpark or Scala).

- Mandatory experience with Data Lakehouse technologies (Delta Lake, Iceberg, or Hudi).

- Strong experience with at least one public cloud data platform (AWS, GCP, or Azure).

- Solid knowledge of data modeling (Dimensional, Data Vault) and advanced SQL.

- Experience with workflow orchestration tools like Apache Airflow or Prefect