About the Role :

We are seeking a skilled Senior Data Engineer with hands-on experience in big data frameworks, cloud data platforms, orchestration tools, and distributed data processing. The role involves designing, configuring, optimizing, and maintaining end-to-end data pipelines and architectures across Apache Spark, Airflow, Trino, Azure Data Services, and Linux environments.

Roles and Responsibilities :

Apache Spark & PySpark :

- Install, configure, and manage Apache Spark (open-source) clusters on Ubuntu.

- Set up Spark master/worker nodes and Spark environment files.

- Configure and manage Spark UI and Spark History Server for monitoring jobs, analyzing DAGs, and troubleshooting performance.

- Develop, optimize, and deploy PySpark ETL/ELT pipelines using DataFrame API, UDFs, window functions, caching, partitioning, and broadcasting.

- Deploy PySpark jobs using spark-submit (client/cluster mode) with proper logging and error handling.

Apache Airflow & Orchestration :

- Install, configure, and manage Apache Airflow including UI, scheduler, and webserver.

- Create, schedule, and monitor Airflow DAGs for PySpark jobs using SparkSubmitOperator, BashOperator, and PythonOperator.

- Configure and manage cron jobs for data processing tasks.

Trino (PrestoSQL) :

- Install, configure, and optimize Trino coordinator and worker nodes.

- Configure catalogs such as S3, MySQL, PostgreSQL.

- Troubleshoot and optimize distributed SQL queries.

Linux/Ubuntu Administration :

- Maintain Linux servers, system services, logs, environment variables, memory usage, and resolve port conflicts.

Azure Data Engineering :

- Design and implement scalable architectures using Azure Data Services including ADF, Synapse, ADLS, Azure SQL, and Databricks.

- Develop and manage ETL/ELT pipelines using Azure Data Factory (Pipelines, Dataflows, Mapping Dataflows).

- Perform data migration, upgrades, and modernization using Azure tools.

- Implement CI/CD pipelines for data workloads using Azure DevOps and Git.

Data Processing & Analytics :

- Work with structured, semi-structured, and unstructured data.

- Implement data analytics, transformation, and backup/recovery solutions.

- Ensure data quality, governance, lineage, metadata management, and security compliance.

- Design and optimize data models using star and snowflake schemas.

- Build data warehouses, Delta Lake, and Lakehouse architectures.

Reporting & Visualization :

- Build and enhance dashboards using Power BI, Tableau, or similar tools.

Collaboration & Documentation :

- Collaborate with internal teams, business stakeholders, and clients to gather requirements.

- Prepare documentation, runbooks, and operational guides.

Technical Skills Required :

Mandatory Skills (4 to 7 Years Experience) :

Apache Spark & PySpark :

- Spark installation and cluster configuration (Linux/Ubuntu)

- Master/worker setup (standalone/cluster mode)

- Spark UI & History Server setup and debugging

- PySpark development (ETL pipelines, UDFs, window functions)

- Performance tuning (partitioning, caching, shuffle optimization)

- spark-submit deployment with monitoring and logging

Apache Airflow :

- Installation & configuration (UI, scheduler, webserver)

- Building & scheduling DAGs

- Retry logic, alerting, log management

- Cron job scheduling

Trino (PrestoSQL) :

- Coordinator & worker node setup

- Catalog configuration (S3, RDBMS)

- Distributed SQL troubleshooting

Azure Data Services (Nice to Have) :

- Azure Data Factory

- Azure Synapse Analytics

- Azure SQL / Cosmos DB

- ADLS Gen2

- Azure Databricks (Delta, Notebooks, Jobs)

- Event Hubs, Stream Analytics

Microsoft Fabric (Nice to Have) :

- Lakehouse

- Warehouse

- Dataflows

- Notebooks

- Pipelines

Programming & Querying

- Python

- PySpark

- SQL

- Scala

Data Modeling & Warehousing :

- Star schema, Snowflake schema

- Fact/dimension modeling

- Data warehouse & Lakehouse design

- Delta Lake

DevOps & CI/CD :

- Git / GitHub / Azure Repos

- Azure DevOps pipelines (CI/CD)

- Automated deployment for Spark, Airflow, ADF, Databricks, Fabric

BI Tools (Nice to Have) :

- Power BI

- Tableau

- Dataset creation, DAX, reporting

Linux/Ubuntu :

- Shell scripting

- Service & log management

- Environment variables

Soft Skills (4 to 7 Years Experience) :

- Excellent problem-solving and communication skills

- Strong organizational and time management abilities

- Ability to work effectively in a team environment

- Taking ownership of tasks end-to-end

- Production support and timely delivery

- Self-driven, flexible, and innovative

Certifications preferred : Microsoft Certified Azure Data Engineer Associate (DP-203 / DP-300)

Understanding of DevOps and CI/CD practices in Azure

Qualification : BSc/BA in Computer Science, Engineering, or a related field