Posted on: 10/04/2026
The Role :
Staff Engineers own systems not features.
- You are the person the team comes to when the problem is genuinely hard, the architecture has a structural flaw, or the incident has gone on too long.
- You do not wait to be assigned the hard problem you identify it, scope it, design the solution, and drive the implementation.
- You own the technical delivery for a major platform subsystem from design through production through monitoring
- Your design documents pre-empt the questions engineers will ask six months later
- You define the SLOs for your systems, build the dashboards, and own the on-call escalation path
- Senior engineers grow faster because of your code reviews you teach, you don't just correct
The Platform You're Joining :
- You are not joining a travel app.
- You are joining the engineering team building India's most intelligent travel commerce operating system a billion-dollar marketplace that connects 100M+ travellers to flights, hotels, buses, trains, and corporate travel every year.
Core Responsibilities :
System Design & Ownership :
- Own end-to-end design and delivery of complex distributed system components: from API contract and data model through implementation, testing, deployment, and monitoring
- Lead system design reviews for your scope: your review catches inconsistent failure modes, missing idempotency guards, and hidden latency bottlenecks before they reach production
- Define the operational model for your systems: runbooks, escalation paths, SLO thresholds, capacity planning estimates, and DR playbooks
- Drive performance engineering: baseline benchmarking, load testing (k6/Gatling), bottleneck analysis (async-profiler, JFR), and optimisation implementation
Distributed Systems Delivery :
- Build fault-tolerant microservices with proper circuit breakers, bulkhead isolation, retry with exponential backoff, and timeout budgets aligned to the upstream SLA
- Design and implement event-driven workflows on Kafka: consumer group isolation, offset management, exactly-once processing, and dead-letter queue handling
- Implement multi-tier caching: define what goes in L1 (Caffeine), L2 (Redis), and what must always be fetched live; build cache warming and invalidation pipelines
- Build idempotent APIs: define idempotency key strategies, implement deduplication stores, handle concurrent duplicate request scenarios gracefully
AI/ML Integration :
- Build the serving infrastructure for ML models in your domain: real-time prediction API endpoints, async batch scoring pipelines, model warm-up, and latency monitoring
- Implement real-time feature computation pipelines feeding the Feast feature store: Kafka Streams processing, feature transformation, and online store writes
- Integrate RAG retrieval APIs into product flows: vector store queries, document retrieval, context assembly, and response validation pipelines
- Build the infrastructure for agentic workflows: Temporal activity implementations for booking/modification/cancellation tool-use APIs
- Implement Voice AI backend: ASR webhook receivers, intent parsing API, session context management for multi-turn spoken booking flows
Platform Work Across All Six Verticals :
- Platform
- Engineering Problem
- What You'll Own & Build
Flights :
- Fare cache staleness under airline API instability
- Own the fare caching layer: TTL strategies, event-driven invalidation, stampede prevention, and fallback heuristics when live pricing is unavailable
Hotels :
- Real-time availability sync for 500K+ properties
- Build the availability sync pipeline: webhook ingestion, change-data-capture from supplier APIs, cache warming, and consistency reconciliation at checkout
The Hard Engineering Problems You'll Face :
- Across all six platforms, the engineering challenges are real, non-trivial, and consequential:
Cache Invalidation at Speed :
- Fare data has a 30-second freshness window.
- A stale cache hit in the booking flow means a pricing error, a failed checkout, or a lost trust signal.
- Multi-tier cache design (L1/L2/L3), TTL strategies, event-driven invalidation via Kafka, and cache stampede prevention are all live problems.
AI-First Engineering Mandate :
- Platform Engineers at every level are responsible for building systems that AI and ML can run on and increasingly, systems that are AI themselves.
- ML Serving Infrastructure: your APIs must serve model predictions at p99 <20ms with graceful fallbacks you design the latency budget allocation
- Feature Pipeline Engineering: real-time feature computation (Kafka Streams, Flink) feeding the feature store at sub-second freshness
- RAG Backend Systems: vector store integration, embedding generation pipelines, document chunking and indexing for knowledge retrieval
- Agentic Workflow Infrastructure: durable execution systems (Temporal) for multi-step LLM agent workflows with retry and compensation logic
- Voice AI Backend: ASR request routing, low-latency TTS pipelines, spoken intent API design for conversational booking flows
- Recommendation API Design: serving infrastructure for collaborative filtering, session-based models, and personalised ranking endpoints
- Price Intelligence Pipelines: real-time competitive price ingestion, fare change event streaming, lower-price guarantee trigger systems
- A/B Experiment Infrastructure: feature flags, traffic splitting, metric collection, and experiment configuration systems
- MCP Tool Orchestration: building the tool-use APIs that LLM agents call to execute booking, modify, and cancel operations safely
Infrastructure & Scale Context :
The systems you will work on or depend on :
Compute :
- AWS EKS (Kubernetes) 500+ pods, autoscaling, spot instance optimisation
Cache :
Multi-tier : L1 in-process (Caffeine) L2 Redis Cluster L3 CDN edge cache; 10M+ keys/sec
Streaming :
Apache Kafka 100+ topics, 5M+ events/sec, consumer lag SLOs < 500ms
Storage :
Polyglot : DynamoDB (booking state) Aerospike (fare cache) Elasticsearch (search) Aurora (OLTP)
Observability :
- OpenTelemetry traces Jaeger; Prometheus metrics Grafana; structured logs Loki; SLO dashboards
ML Serving :
Real-time : TorchServe / Triton on GPU nodes, p99 < 20ms.
Batch : Spark on EMR
Feature Store :
- Feast 300+ features, online (Redis) + offline (S3/Hive), sub-10ms online reads
APIs :
- REST + gRPC; 200+ internal services; 50M+ external API calls/day; contract testing via Pact
Security :
- OAuth2/JWT, Vault for secrets, AWS IAM, zero-trust internal service mesh
Who You Are :
- 7 to 12 years in backend engineering; have owned at least one Tier-1 production system from design through sustained operation
- Deep knowledge of distributed systems: you can reason about consistency, partition tolerance, and failure modes from first principles
- Production experience with Kafka, Redis, gRPC, and cloud-native infrastructure you know what breaks and why
- ML integration experience: you have built or maintained a real-time ML serving integration in a production API path
- Strong communicator: your design docs are read, your code comments explain intent, your incident post-mortems are actioned
- Tier-I institute preferred (IIT / IIIT / NIT / IISC / BITS/CSE / ISE)
Technology Stack :
Did you find something suspicious?
Posted by
Posted in
Backend Development
Functional Area
ML / DL Engineering
Job Code
1627608