Posted on: 16/12/2025
Description :
Job Title : Data Platform Engineer
Location : Remote, India
Experience : 5+ Years
Employment Type : Full-Time
Work Timings : Rotational Shifts (UK / US EST / US PST)
On-Call Support : Yes (24x7 Rotation)
Start Date : ASAP (Immediate / Early Joiners Preferred)
Job Summary :
We are seeking a highly skilled and motivated Data Platform Engineer to design, build, operate, and scale large-scale, highly reliable data platforms that power mission-critical global products. This role is ideal for engineers with a strong SRE mindset who thrive on architecting distributed systems, driving automation, and ensuring high availability, performance, and resilience of complex data services.
As part of a planet-scale cloud services organization, you will work in an embedded SRE model, partnering closely with product and engineering teams to deliver reliable, observable, and scalable data infrastructure used by customers worldwide.
Role Description :
The Data Platform SRE team is responsible for operating and evolving large-scale data systems across bare-metal and cloud environments. In this role, you will take end-to-end ownership of data platform reliabilitycovering observability, SLO-driven operations, incident management, automation, and continuous performance optimization.
You will work with a diverse technology stack that includes open-source frameworks, vendor platforms, and internally developed tools, supporting globally distributed data centers and cloud-native environments.
Key Responsibilities :
Platform Reliability & Operations :
- Operate, support, and continuously improve production-grade data platforms at scale across on-premise and public cloud environments.
- Perform capacity planning, performance benchmarking, and stress testing for distributed data systems.
- Design and validate disaster recovery (DR) and business continuity strategies.
- Participate in a 24x7 on-call rotation to ensure service availability and rapid incident resolution.
Embedded SRE Collaboration :
- Partner closely with application and data engineering teams in a unified embedded SRE model.
- Drive and enforce standardized incident management, escalation, and postmortem practices.
- Define, measure, and maintain user-journeybased SLOs, SLIs, and error budgets.
- Build deep observability across systems using metrics, logs, and traces.
Data Platform Engineering :
- Architect, deploy, tune, and troubleshoot large-scale Big Data platforms, including but not limited to :
- Apache Spark, Flink, Airflow
- Hive, Hadoop/HDFS
- Trino, Druid
- Manage and optimize storage and coordination systems such as Cassandra, Zookeeper, Redis, and cloud-based block, file, and object storage solutions.
Automation & Tooling :
- Reduce operational toil through automation, self-healing systems, and platform engineering best practices.
- Design and develop tools and services using Python, Go, Java, or Scala.
- Build and maintain CI/CD pipelines for infrastructure and platform services.
- Implement configuration management and Infrastructure-as-Code (IaC) practices.
Infrastructure & Systems Engineering :
- Design and operate large-scale Kubernetes-based environments, with strong experience in EKS and/or GKE.
- Manage Linux systems, containers, virtualization, networking, and storage components.
- Perform deep root-cause analysis using structured, data-driven troubleshooting techniques.
Must-Have Skills :
- 5+ years of experience in Site Reliability Engineering, Platform Engineering, or Data Infrastructure roles.
- Strong hands-on experience with Kubernetes in production environments.
- Proven experience with Amazon EKS and/or Google Kubernetes Engine (GKE).
- Proficiency in Python for automation and tooling (basic to mid-level coding).
- Strong Linux systems administration and troubleshooting experience.
- Hands-on expertise with Big Data technologies such as Spark, Flink, Airflow, Hive, Hadoop/HDFS, Trino, and Druid.
- Solid understanding of high-availability and distributed systems architecture.
- Experience with Terraform or other Infrastructure-as-Code tools.
Nice-to-Have Skills :
- Java development experience.
- Experience with Pulumi or advanced IaC frameworks.
- Prior experience operating platforms at global or planet-scale environments.
- Exposure to multi-cloud or hybrid cloud architectures.
Why Join Us :
- Work on mission-critical, globally distributed data platforms.
- Be part of a strong SRE culture focused on reliability, automation, and engineering excellence.
- Opportunity to influence platform architecture and long-term reliability strategy.
- Fully remote role with exposure to global teams and cutting-edge data infrastructure.
Did you find something suspicious?
Posted by
Functional Area
Mobile Development - iOS
Job Code
1590684
Interview Questions for you
View All