JOB DESCRIPTION : Data Engineer

We are seeking a highly skilled Data Engineer with deep expertise in Apache Kafka integration with Databricks, structured streaming, and large-scale data pipeline design using the Medallion Architecture. The ideal candidate will demonstrate strong hands-on experience in building and optimizing real-time and batch pipelines, and will be expected to solve real coding problems during the interview.

Job Description :

- Design, develop, and maintain real-time and batch data pipelines in Databricks.

- Integrate Apache Kafka with Databricks using Structured Streaming.

- Implement robust data ingestion frameworks using Databricks Autoloader.

- Build and maintain Medallion Architecture pipelines across Bronze, Silver, and Gold layers.

- Implement checkpointing, output modes, and appropriate processing modes in structured streaming jobs.

- Design and implement Change Data Capture (CDC) workflows and Slowly Changing Dimensions (SCD) Type 1 and Type 2 logic.

- Develop reusable components for merge/upsert operations and window function based transformations.

- Handle large volumes of data efficiently through proper partitioning, caching, and cluster tuning techniques.

- Collaborate with cross-functional teams to ensure data availability, reliability, and consistency.

Must Have :

- Apache Kafka : Integration, topic management, schema registry (Avro/JSON).

- Databricks & Spark Structured Streaming :

1. Processing Modes: Append, Update, Complete

2. Output Modes: Memory, Console, File, Kafka, Delta

3. Checkpointing and fault tolerance

- Databricks Autoloader : Schema inference, schema evolution, incremental loads.

- Medallion Architecture implementation expertise.

- Performance Optimization :

i. Data partitioning strategies

ii. Caching and persistence

iii. Adaptive query execution and cluster configuration tuning

- SQL & Spark SQL : Proficiency in writing efficient queries and transformations.

- Data Governance : Schema enforcement, data quality checks, and monitoring.

Good to Have :

- Strong coding skills in Python and PySpark.

- Experience working in CI/CD environments for data pipelines.

- Exposure to cloud platforms (AWS/Azure/GCP).

- Understanding of Delta Lake, time travel, and data versioning.

- Familiarity with orchestration tools like Airflow or Azure Data Factory.