Posted on: 23/03/2026
Description :
Data Scientist Frontier AI for Data Platforms & Distributed Systems (48 Years)
Experience : 4-8 Years
Location : Bengaluru (On-site / Hybrid)
Company : Publicly Listed, Global Product Platform
About the Mission :
We are building a Top 1% AI-Native Engineering & Data Organization from first principles.
This is not incremental improvement.
This is a full-stack transformation of a large-scale enterprise into an AI-native data platform company.
We are re-architecting :
- Legacy systems - AI-native architectures
- Static pipelines - autonomous, self-healing systems
- Data platforms - intelligent, learning systems
- Software workflows - agentic execution layers
- This is the kind of shift you would expect from companies like Google or Microsoft
- Except here, you will build it from day zero and scale it globally.
The Opportunity : This role sits at the intersection of three high-impact domains :
1. Frontier AI Systems : Large Language Models (LLMs), Small Language Models (SLMs), and Agentic AI
2. Data Platforms : Warehouses, Lakehouses, Streaming Systems, Query Engines
3. Distributed Systems : High-throughput, low-latency, multi-region infrastructure
We are building systems where :
- Data platforms optimize themselves using ML/LLMs
- Pipelines are autonomous, self-healing, and adaptive
- Queries are generated, optimized, and executed intelligently
- Infrastructure learns from usage and evolves continuously
- This is : AI as the control plane for data infrastructure
What Youll Work On :
You will design and build AI-native systems deeply embedded inside data infrastructure.
1. AI-Native Data Platforms :
Build LLM-powered interfaces :
- Natural language - SQL / pipelines / transformations
Design semantic data layers :
- Embeddings, vector search, knowledge graphs
Develop AI copilots :
- For data engineers, analysts, and platform users
2. Autonomous Data Pipelines :
- Build self-healing ETL/ELT systems using AI agents
Create pipelines that :
- Detect anomalies in real time
- Automatically debug failures
- Dynamically optimize transformations
3. Intelligent Query & Compute Optimization :
Apply ML/LLMs to :
- Query planning and execution
- Cost-based optimization using learned models
- Workload prediction and scheduling
Build systems that :
- Learn from query patterns
- Continuously improve performance and cost efficiency
4. Distributed Data + AI Infrastructure :
Architect systems operating at :
- Billions of events per day
- Petabyte-scale data
Work with :
- Distributed compute engines (Spark / Flink / Ray class systems)
- Streaming systems (Kafka-class infra)
- Vector databases and hybrid retrieval systems
5. Learning Systems & Feedback Loops :
Build closed-loop AI systems :
- Execution - feedback model updates
Develop :
- Continual learning pipelines
- Online learning systems for infra optimization
- Experimentation frameworks (A/B, bandits, eval pipelines)
6. LLM & Agentic Systems (Infra-Aware) :
- Build agents that understand data systems
Enable :
- Autonomous pipeline debugging
- Root cause analysis for infra failures
- Intelligent orchestration of data workflows
What Were Looking For :
Core Foundations :
Strong grounding in :
- Machine Learning, Deep Learning, NLP
- Statistics, optimization, probabilistic systems
- Distributed systems fundamentals
Deep understanding of :
- Transformer architectures
- Modern LLM ecosystems
- Hands-On Expertise
Experience building :
- LLM / GenAI systems (RAG, fine-tuning, embeddings)
- Data platforms (warehouse, lake, lakehouse architectures)
- Distributed pipelines and compute systems
Strong programming skills :
- Python (ML/AI stack)
- SQL (deep understanding query planning, optimization mindset)
- Systems Thinking (Critical)
You think in systems, not components.
Built or worked on :
- Large-scale data pipelines
- High-throughput distributed systems
- Low-latency, high-concurrency architectures
Understand :
- Query optimization and execution
- Data partitioning, indexing, caching
- Trade-offs in distributed systems
What Sets You Apart (Top 1%) :
- Built AI-powered data platforms or infra systems in production
Designed or contributed to :
- Query engines / optimizers
- Data observability / lineage systems
- AI-driven infra or AIOps platforms
Experience with :
- Agentic AI systems
- Autonomous infrastructure
Worked on systems at scale comparable to :
- Google (BigQuery-like systems)
- Meta (real-time analytics infra)
- Snowflake / Databricks (lakehouse architectures)
Ideal Background (Not Mandatory) :
We often see strong candidates from :
- Data infrastructure or platform engineering teams
- AI-first startups or research-driven environments
- High-scale product companies
Experience building :
- Internal platforms used by 1000s of engineers
- Systems serving millions of users / high throughput workloads
- Multi-region, distributed cloud systems
The Kind of Problems Youll Solve :
- Can LLMs replace traditional query optimizers?
- How do we build self-healing data pipelines at scale?
- Can data systems learn from every query and improve automatically?
- How do we embed reasoning and planning into infrastructure layers?
- What does a fully autonomous data platform look like?
Background : We Commonly See (But Not Limited To) :
Our team often includes engineers from top-tier institutions and strong research or product backgrounds, including :
- Leading engineering schools in India and globally
- Engineers with experience in top product companies, AI startups, or research-driven environments
- That said, we care far more about demonstrated ability, depth, and impact than pedigree alone.
Did you find something suspicious?