Location : Bengaluru / Hyderabad

Job Summary :

We are seeking a highly skilled and experienced Senior Scala Data Engineer to join our dynamic data team. In this role, you will be instrumental in designing, developing, and maintaining our next-generation data pipelines and platforms using Scala, Apache Spark, and cloud-native technologies.

You will work on challenging problems involving large-scale data ingestion, transformation, and processing, contributing directly to our analytical capabilities and product features.

Key Responsibilities :

- Design & Development : Architect, build, and optimize robust, scalable, and efficient data pipelines using Scala and Apache Spark (Spark Core, Spark SQL, Spark Streaming).

- Data Ingestion : Develop solutions for ingesting high-volume, high-velocity data from various sources (e.g., relational databases, NoSQL databases, APIs, message queues like Kafka, log files) into our data lake/warehouse.

- Data Transformation : Implement complex data transformations, aggregations, and feature engineering logic to prepare data for analytics, machine learning models, and operational systems.

- Performance Optimization : Identify and resolve performance bottlenecks in Spark jobs and data pipelines, ensuring optimal resource utilization and execution times.

- Data Quality & Governance : Implement data validation, monitoring, and alerting mechanisms to ensure data accuracy, completeness, and consistency. Contribute to data governance best practices.

- Cloud Infrastructure : Leverage and optimize cloud services (e.g., AWS EMR/Glue, Azure Databricks/Synapse, GCP DataProc/BigQuery) for data processing and storage.

- Automation & Orchestration : Design and implement automated workflows for data pipelines using tools like Apache Airflow, AWS Step Functions, or similar.

Required Qualifications :

- Experience : 4+ years of professional experience in data engineering, with a strong focus on building large-scale data solutions.

- Scala Expertise : Proven advanced proficiency in Scala programming language.

- Apache Spark : Deep hands-on experience with Apache Spark (Core, SQL, Streaming) for batch and real-time data processing.

- Cloud Platforms : Extensive experience with at least one major cloud provider (AWS, Azure, or GCP) and their relevant data services (e.g., AWS S3, EMR, Glue, Kinesis; Azure Data Lake, Databricks, Event Hubs; GCP GCS, DataProc, Pub/Sub).

- Data Warehousing : Strong understanding of data warehousing concepts, dimensional modeling (star/snowflake schemas), and ETL/ELT processes.

- SQL : Expert-level SQL skills for data querying, manipulation, and optimization.

- Distributed Systems : Experience working with distributed systems and understanding of their challenges (consistency, fault tolerance, concurrency).

- Version Control : Proficiency with Git and collaborative development workflows.

Nice-to-Haves :

- Streaming Technologies : Experience with real-time streaming platforms like Apache Kafka, Apache Flink, or Kinesis.

- Containerization & Orchestration : Experience with Docker, Kubernetes, and container orchestration for Spark applications.

- Data Orchestration Tools : Hands-on experience with Apache Airflow, Dagster, Prefect, or similar workflow management tools.

- NoSQL Databases : Experience with NoSQL databases such as Cassandra, MongoDB, DynamoDB, or HBase.

- Data Lakehouse/Modern DW : Experience with technologies like Delta Lake, Apache Iceberg, Snowflake, Redshift, or BigQuery.

- MLOps : Familiarity with MLOps principles and supporting data pipelines for machine learning models.

- CI/CD : Experience setting up and maintaining CI/CD pipelines for data engineering projects.

- Performance Tuning : Advanced knowledge of Spark performance tuning techniques, including memory management, shuffle optimization, and data partitioning strategies.

- Certifications : Relevant cloud (AWS Certified Data Analytics, Azure Data Engineer Associate, GCP Professional Data Engineer) or Spark certifications.