About the Role :

As part of our commitment to driving innovation through data, we are building a best-in-class advanced analytics practice. We are seeking a Senior Data Lake Developer with proven expertise in architecting, developing, and optimizing scalable cloud-based data lake platforms on AWS. You will be central to shaping the data architecture and pipelines that power enterprise-wide data analytics, machine learning models, and intelligent insights.

This role demands deep technical proficiency across AWS data services, delta lake architectures, and big data tools such as Apache Spark and Databricks. You will drive initiatives in designing robust and performant data lakes, streamlining data ingestion and transformation workflows, and ensuring consistent, secure, and scalable access to high-quality data assets.

Key Responsibilities :

Data Lake Architecture & Development :

- Design, build, and scale AWS-based data lakes, ensuring high performance, resilience, and extensibility.

- Engineer data ingestion and transformation pipelines using AWS Glue, Apache Spark, or EMR.

- Implement Delta Lake patterns on S3 using technologies such as Databricks Delta or Apache Hudi for efficient ACID transactions and time-travel capabilities.

- Optimize data storage formats (Parquet, ORC, Avro) for scalable data processing and querying.

Data Engineering & ETL :

- Develop robust ETL/ELT frameworks using tools such as AWS Glue, Informatica, or Apache Nifi to orchestrate structured and unstructured data ingestion.

- Build real-time and batch data pipelines leveraging services like Kinesis, DMS, and Lambda.

- Ensure data quality, lineage, and governance via AWS Lake Formation and Glue Data Catalog.

Data Modeling & Warehousing :

- Design and implement dimensional models (Star, Snowflake, Flattened) supporting both OLTP and OLAP workloads.

- Architect and optimize data warehouses in Snowflake, Redshift, Teradata, or SAP HANA.

- Collaborate with analytics and data science teams to translate business logic into performant data models and materialized views.

Platform Engineering & DevOps :

- Apply DevOps best practices to manage infrastructure-as-code (e.g., using CloudFormation or Terraform).

- Maintain and version data engineering code using Git and CI/CD workflows.

- Debug, monitor, and resolve data lake reliability, performance, and security issues.

Collaboration & Delivery :

- Partner with cross-functional teams (Data Science, BI, Product) to understand business data requirements and deliver fit-for-purpose solutions.

- Lead architectural decisions related to scalability, performance optimization, and technology evaluation.

- Operate within Agile Scrum teams, contributing to sprint planning, story grooming, and iterative delivery cycles.

Required Skills & Experience :

- 5+ years of experience designing and operating AWS cloud-native data lake architectures

- Strong hands-on knowledge of AWS services : S3, Glue, Lake Formation, EMR, Redshift, RDS, DMS, Kinesis

- 3+ years working with Apache Spark for large-scale distributed data processing

- Proven experience with Delta Lake implementations using Databricks, Apache Hudi, or equivalent frameworks

- Strong command over data modeling techniques, especially for large-scale warehousing solutions (OLTP + OLAP)

- Proficiency in ETL/ELT development using tools such as AWS Glue, Informatica, or Talend

- Solid programming skills in Python, Scala, Java, or R

- Knowledge of data formats, schema evolution, and metadata management in a data lake environment

- Bachelors degree in Computer Science, Data Engineering, or a related technical field

- Experience working in Agile delivery environments and comfort using Agile collaboration tools (e.g., Jira, Confluence)

Preferred Qualifications :

- Certification in AWS Big Data, Data Analytics, or Solutions Architecture

- Exposure to DataOps, data security, and governance frameworks (IAM, encryption, Lake Formation policies)

- Familiarity with containerized data applications using Docker or Kubernetes

- Experience supporting ML workflows or real-time analytics on data lakes