Posted on: 01/12/2025
About the Role :
We are seeking a skilled Senior Data Engineer with hands-on experience in big data frameworks, cloud data platforms, orchestration tools, and distributed data processing. The role involves designing, configuring, optimizing, and maintaining end-to-end data pipelines and architectures across Apache Spark, Airflow, Trino, Azure Data Services, and Linux environments.
Roles and Responsibilities :
Apache Spark & PySpark :
- Install, configure, and manage Apache Spark (open-source) clusters on Ubuntu.
- Set up Spark master/worker nodes and Spark environment files.
- Configure and manage Spark UI and Spark History Server for monitoring jobs, analyzing DAGs, and troubleshooting performance.
- Develop, optimize, and deploy PySpark ETL/ELT pipelines using DataFrame API, UDFs, window functions, caching, partitioning, and broadcasting.
- Deploy PySpark jobs using spark-submit (client/cluster mode) with proper logging and error handling.
Apache Airflow & Orchestration :
- Install, configure, and manage Apache Airflow including UI, scheduler, and webserver.
- Create, schedule, and monitor Airflow DAGs for PySpark jobs using SparkSubmitOperator, BashOperator, and PythonOperator.
- Configure and manage cron jobs for data processing tasks.
Trino (PrestoSQL) :
- Install, configure, and optimize Trino coordinator and worker nodes.
- Configure catalogs such as S3, MySQL, PostgreSQL.
- Troubleshoot and optimize distributed SQL queries.
Linux/Ubuntu Administration :
- Maintain Linux servers, system services, logs, environment variables, memory usage, and resolve port conflicts.
Azure Data Engineering :
- Design and implement scalable architectures using Azure Data Services including ADF, Synapse, ADLS, Azure SQL, and Databricks.
- Develop and manage ETL/ELT pipelines using Azure Data Factory (Pipelines, Dataflows, Mapping Dataflows).
- Perform data migration, upgrades, and modernization using Azure tools.
- Implement CI/CD pipelines for data workloads using Azure DevOps and Git.
Data Processing & Analytics :
- Work with structured, semi-structured, and unstructured data.
- Implement data analytics, transformation, and backup/recovery solutions.
- Ensure data quality, governance, lineage, metadata management, and security compliance.
- Design and optimize data models using star and snowflake schemas.
- Build data warehouses, Delta Lake, and Lakehouse architectures.
Reporting & Visualization :
- Build and enhance dashboards using Power BI, Tableau, or similar tools.
Collaboration & Documentation :
- Collaborate with internal teams, business stakeholders, and clients to gather requirements.
- Prepare documentation, runbooks, and operational guides.
Technical Skills Required :
Mandatory Skills (4 to 7 Years Experience) :
Apache Spark & PySpark :
- Spark installation and cluster configuration (Linux/Ubuntu)
- Master/worker setup (standalone/cluster mode)
- Spark UI & History Server setup and debugging
- PySpark development (ETL pipelines, UDFs, window functions)
- Performance tuning (partitioning, caching, shuffle optimization)
- spark-submit deployment with monitoring and logging
Apache Airflow :
- Installation & configuration (UI, scheduler, webserver)
- Building & scheduling DAGs
- Retry logic, alerting, log management
- Cron job scheduling
Trino (PrestoSQL) :
- Coordinator & worker node setup
- Catalog configuration (S3, RDBMS)
- Distributed SQL troubleshooting
Azure Data Services (Nice to Have) :
- Azure Data Factory
- Azure Synapse Analytics
- Azure SQL / Cosmos DB
- ADLS Gen2
- Azure Databricks (Delta, Notebooks, Jobs)
- Event Hubs, Stream Analytics
Microsoft Fabric (Nice to Have) :
- Lakehouse
- Warehouse
- Dataflows
- Notebooks
- Pipelines
Programming & Querying
- Python
- PySpark
- SQL
- Scala
Data Modeling & Warehousing :
- Star schema, Snowflake schema
- Fact/dimension modeling
- Data warehouse & Lakehouse design
- Delta Lake
DevOps & CI/CD :
- Git / GitHub / Azure Repos
- Azure DevOps pipelines (CI/CD)
- Automated deployment for Spark, Airflow, ADF, Databricks, Fabric
BI Tools (Nice to Have) :
- Power BI
- Tableau
- Dataset creation, DAX, reporting
Linux/Ubuntu :
- Shell scripting
- Service & log management
- Environment variables
Soft Skills (4 to 7 Years Experience) :
- Excellent problem-solving and communication skills
- Strong organizational and time management abilities
- Ability to work effectively in a team environment
- Taking ownership of tasks end-to-end
- Production support and timely delivery
- Self-driven, flexible, and innovative
Certifications preferred : Microsoft Certified Azure Data Engineer Associate (DP-203 / DP-300)
Understanding of DevOps and CI/CD practices in Azure
Qualification : BSc/BA in Computer Science, Engineering, or a related field
Did you find something suspicious?
Posted By
Posted in
Data Engineering
Functional Area
Data Engineering
Job Code
1583396
Interview Questions for you
View All