HamburgerMenu
hirist

Data Scientist - Data Platforms & Distributed Systems

Recruiting Bond
4 - 8 Years
rupee25-80 LPA
Bangalore

Posted on: 23/03/2026

Job Description

Description :

Data Scientist Frontier AI for Data Platforms & Distributed Systems (48 Years)

Experience : 4-8 Years

Location : Bengaluru (On-site / Hybrid)

Company : Publicly Listed, Global Product Platform

About the Mission :

We are building a Top 1% AI-Native Engineering & Data Organization from first principles.

This is not incremental improvement.

This is a full-stack transformation of a large-scale enterprise into an AI-native data platform company.

We are re-architecting :

- Legacy systems - AI-native architectures


- Static pipelines - autonomous, self-healing systems


- Data platforms - intelligent, learning systems


- Software workflows - agentic execution layers


- This is the kind of shift you would expect from companies like Google or Microsoft


- Except here, you will build it from day zero and scale it globally.

The Opportunity : This role sits at the intersection of three high-impact domains :

1. Frontier AI Systems : Large Language Models (LLMs), Small Language Models (SLMs), and Agentic AI

2. Data Platforms : Warehouses, Lakehouses, Streaming Systems, Query Engines

3. Distributed Systems : High-throughput, low-latency, multi-region infrastructure

We are building systems where :

- Data platforms optimize themselves using ML/LLMs


- Pipelines are autonomous, self-healing, and adaptive


- Queries are generated, optimized, and executed intelligently


- Infrastructure learns from usage and evolves continuously


- This is : AI as the control plane for data infrastructure

What Youll Work On :

You will design and build AI-native systems deeply embedded inside data infrastructure.

1. AI-Native Data Platforms :

Build LLM-powered interfaces :


- Natural language - SQL / pipelines / transformations

Design semantic data layers :


- Embeddings, vector search, knowledge graphs

Develop AI copilots :


- For data engineers, analysts, and platform users

2. Autonomous Data Pipelines :

- Build self-healing ETL/ELT systems using AI agents

Create pipelines that :


- Detect anomalies in real time


- Automatically debug failures


- Dynamically optimize transformations

3. Intelligent Query & Compute Optimization :

Apply ML/LLMs to :


- Query planning and execution


- Cost-based optimization using learned models


- Workload prediction and scheduling

Build systems that :


- Learn from query patterns


- Continuously improve performance and cost efficiency

4. Distributed Data + AI Infrastructure :

Architect systems operating at :


- Billions of events per day


- Petabyte-scale data

Work with :



- Distributed compute engines (Spark / Flink / Ray class systems)


- Streaming systems (Kafka-class infra)


- Vector databases and hybrid retrieval systems

5. Learning Systems & Feedback Loops :

Build closed-loop AI systems :


- Execution - feedback model updates

Develop :


- Continual learning pipelines


- Online learning systems for infra optimization


- Experimentation frameworks (A/B, bandits, eval pipelines)

6. LLM & Agentic Systems (Infra-Aware) :

- Build agents that understand data systems

Enable :


- Autonomous pipeline debugging


- Root cause analysis for infra failures


- Intelligent orchestration of data workflows

What Were Looking For :

Core Foundations :

Strong grounding in :


- Machine Learning, Deep Learning, NLP


- Statistics, optimization, probabilistic systems


- Distributed systems fundamentals

Deep understanding of :


- Transformer architectures


- Modern LLM ecosystems


- Hands-On Expertise

Experience building :

- LLM / GenAI systems (RAG, fine-tuning, embeddings)


- Data platforms (warehouse, lake, lakehouse architectures)


- Distributed pipelines and compute systems

Strong programming skills :


- Python (ML/AI stack)


- SQL (deep understanding query planning, optimization mindset)


- Systems Thinking (Critical)


You think in systems, not components.

Built or worked on :

- Large-scale data pipelines


- High-throughput distributed systems


- Low-latency, high-concurrency architectures

Understand :


- Query optimization and execution


- Data partitioning, indexing, caching


- Trade-offs in distributed systems

What Sets You Apart (Top 1%) :

- Built AI-powered data platforms or infra systems in production

Designed or contributed to :

- Query engines / optimizers


- Data observability / lineage systems


- AI-driven infra or AIOps platforms


Experience with :



- Multi-modal AI (logs, metrics, traces, text)


- Agentic AI systems


- Autonomous infrastructure

Worked on systems at scale comparable to :


- Google (BigQuery-like systems)


- Meta (real-time analytics infra)


- Snowflake / Databricks (lakehouse architectures)

Ideal Background (Not Mandatory) :

We often see strong candidates from :



- Data infrastructure or platform engineering teams


- AI-first startups or research-driven environments


- High-scale product companies

Experience building :


- Internal platforms used by 1000s of engineers


- Systems serving millions of users / high throughput workloads


- Multi-region, distributed cloud systems

The Kind of Problems Youll Solve :

- Can LLMs replace traditional query optimizers?


- How do we build self-healing data pipelines at scale?


- Can data systems learn from every query and improve automatically?


- How do we embed reasoning and planning into infrastructure layers?


- What does a fully autonomous data platform look like?

Background : We Commonly See (But Not Limited To) :

Our team often includes engineers from top-tier institutions and strong research or product backgrounds, including :



- Leading engineering schools in India and globally


- Engineers with experience in top product companies, AI startups, or research-driven environments


- That said, we care far more about demonstrated ability, depth, and impact than pedigree alone.


info-icon

Did you find something suspicious?

Similar jobs that you might be interested in