HamburgerMenu
hirist

Senior Data Engineer - Big Data/PySpark

Virtusa
Anywhere in India/Multiple Locations
5 - 8 Years

Posted on: 27/11/2025

showcase-imageshowcase-imageshowcase-image

Job Description

Description :


We are looking for a highly skilled Senior Data Engineer with strong experience in building scalable, resilient, and high-performance data pipelines across batch and streaming architectures. The ideal candidate will have deep expertise in cloud platforms (preferably Azure), big-data frameworks, CDC pipelines, and modern Lakehouse technologies such as Delta Lake and Databricks. You will collaborate with cross-functional teams, design end-to-end ingestion frameworks, optimize performance, and ensure data quality across all layers of the platform.


Key Responsibilities :


Data Ingestion & Processing :


- Design and implement streaming data ingestion pipelines using Apache Kafka, Confluent Cloud, and Delta Live Tables.


- Build and maintain batch ingestion workflows leveraging Databricks, PySpark, and cloud-native orchestration tools.


- Utilize Confluent Kafka Connectors for CDC pipelines, including configuration, monitoring, troubleshooting, and optimization.


- Implement Change Data Capture (CDC) solutions using Debezium, SQL Server CDC, Oracle GoldenGate, or equivalent tools.


- Develop efficient PySpark jobs for large-scale data transformations, enrichment, validations, and upsert workloads.


- Ensure performance tuning, including optimizing Spark jobs, partitioning strategies, shuffle minimization, and resource utilization.


Data Quality, Monitoring & Operations :


- Implement robust data validation, deduplication, and reconciliation processes, ensuring high-quality and reliable data loads.


- Analyze and document data load performance metrics, including time taken, throughput, and scalability benchmarks.


- Monitor pipelines for partial loads, duplicate detection, schema evolution, and handle operational exceptions.


- Build alerting, logging, and operational dashboards using tools such as Datadog, Azure Monitor, or CloudWatch.


Architecture & Modelling :


- Design scalable Lakehouse architectures using Delta Lake, Unity Catalog, and medallion patterns (Bronze/Silver/Gold).


- Develop data models, partitioning strategies, and table optimization techniques (Z-Ordering, VACUUM, OPTIMIZE).


- Implement best practices for metadata management, governance, and lineage.


Cloud Platform Expertise :


- Hands-on experience with Azure (ADLS Gen2, ADF, Event Hub, Azure Functions, Synapse, Key Vault, Azure DevOps).


- Exposure to AWS services such as S3, Glue, Lambda, Kinesis, and IAM is a plus.


- Build and optimize compute clusters using Databricks (cluster configs, job clusters, autoscaling, DBX CLI).


Programming & Tools :


- Strong programming experience with Python, Scala, or Java, focusing on data transformation and distributed processing.


- Advanced SQL skills for analytics, transformations, and performance optimization.


- Familiarity with Git, CI/CD pipelines, code reviews, and automated deployments.


- Experience with infrastructure-as-code tools like Terraform, ARM templates, or CloudFormation is advantageous.


Collaboration & Leadership :


- Work with product owners, data modelers, and business stakeholders to interpret requirements and translate them into scalable technical solutions.


- Provide mentorship to junior engineers and establish engineering best practices across the team.


- Communicate complex technical concepts clearly to both technical and non-technical stakeholders.


Required Qualifications :


- 5+ years of experience designing and building data pipelines using Apache Spark, Databricks, or equivalent big-data frameworks.


- Strong hands-on knowledge of Kafka, Confluent Cloud, Event Hub, and messaging/streaming ecosystems.


- Expertise in CDC pipelines, relational databases (SQL Server, Oracle, PostgreSQL), and event-driven processing.


- Experience with Azure or AWS cloud services for data ingestion, compute, and orchestration.


- Solid understanding of data warehousing, Lakehouse architectures, Delta Lake, and data modelling.


- Proficiency with DevOps, Git-based workflows, and CICD frameworks.


- Strong analytical, problem-solving, and communication skills.


Preferred Qualifications :


- Experience with Delta Live Tables (DLT), Databricks Workflows, and Unity Catalog.


- Knowledge of RabbitMQ, Azure Event Hub, or other messaging platforms.


- Exposure to ML pipelines or feature engineering frameworks (nice to have).


- Certifications in Azure, Databricks, or Confluent.


info-icon

Did you find something suspicious?