HamburgerMenu
hirist

Staff Software Engineer - Distributed Systems

Recruiting Bond
7 - 12 Years
Bangalore

Posted on: 10/04/2026

Job Description

The Role :

Staff Engineers own systems not features.

- You are the person the team comes to when the problem is genuinely hard, the architecture has a structural flaw, or the incident has gone on too long.

- You do not wait to be assigned the hard problem you identify it, scope it, design the solution, and drive the implementation.

- You own the technical delivery for a major platform subsystem from design through production through monitoring

- Your design documents pre-empt the questions engineers will ask six months later

- You define the SLOs for your systems, build the dashboards, and own the on-call escalation path

- Senior engineers grow faster because of your code reviews you teach, you don't just correct

The Platform You're Joining :

- You are not joining a travel app.

- You are joining the engineering team building India's most intelligent travel commerce operating system a billion-dollar marketplace that connects 100M+ travellers to flights, hotels, buses, trains, and corporate travel every year.

Core Responsibilities :

System Design & Ownership :

- Own end-to-end design and delivery of complex distributed system components: from API contract and data model through implementation, testing, deployment, and monitoring

- Lead system design reviews for your scope: your review catches inconsistent failure modes, missing idempotency guards, and hidden latency bottlenecks before they reach production

- Define the operational model for your systems: runbooks, escalation paths, SLO thresholds, capacity planning estimates, and DR playbooks

- Drive performance engineering: baseline benchmarking, load testing (k6/Gatling), bottleneck analysis (async-profiler, JFR), and optimisation implementation

Distributed Systems Delivery :

- Build fault-tolerant microservices with proper circuit breakers, bulkhead isolation, retry with exponential backoff, and timeout budgets aligned to the upstream SLA

- Design and implement event-driven workflows on Kafka: consumer group isolation, offset management, exactly-once processing, and dead-letter queue handling

- Implement multi-tier caching: define what goes in L1 (Caffeine), L2 (Redis), and what must always be fetched live; build cache warming and invalidation pipelines

- Build idempotent APIs: define idempotency key strategies, implement deduplication stores, handle concurrent duplicate request scenarios gracefully

AI/ML Integration :

- Build the serving infrastructure for ML models in your domain: real-time prediction API endpoints, async batch scoring pipelines, model warm-up, and latency monitoring

- Implement real-time feature computation pipelines feeding the Feast feature store: Kafka Streams processing, feature transformation, and online store writes

- Integrate RAG retrieval APIs into product flows: vector store queries, document retrieval, context assembly, and response validation pipelines

- Build the infrastructure for agentic workflows: Temporal activity implementations for booking/modification/cancellation tool-use APIs

- Implement Voice AI backend: ASR webhook receivers, intent parsing API, session context management for multi-turn spoken booking flows

Platform Work Across All Six Verticals :

- Platform

- Engineering Problem

- What You'll Own & Build

Flights :

- Fare cache staleness under airline API instability

- Own the fare caching layer: TTL strategies, event-driven invalidation, stampede prevention, and fallback heuristics when live pricing is unavailable

Hotels :

- Real-time availability sync for 500K+ properties

- Build the availability sync pipeline: webhook ingestion, change-data-capture from supplier APIs, cache warming, and consistency reconciliation at checkout

The Hard Engineering Problems You'll Face :

- Across all six platforms, the engineering challenges are real, non-trivial, and consequential:

Cache Invalidation at Speed :

- Fare data has a 30-second freshness window.

- A stale cache hit in the booking flow means a pricing error, a failed checkout, or a lost trust signal.

- Multi-tier cache design (L1/L2/L3), TTL strategies, event-driven invalidation via Kafka, and cache stampede prevention are all live problems.

AI-First Engineering Mandate :

- Platform Engineers at every level are responsible for building systems that AI and ML can run on and increasingly, systems that are AI themselves.

- ML Serving Infrastructure: your APIs must serve model predictions at p99 <20ms with graceful fallbacks you design the latency budget allocation

- Feature Pipeline Engineering: real-time feature computation (Kafka Streams, Flink) feeding the feature store at sub-second freshness

- RAG Backend Systems: vector store integration, embedding generation pipelines, document chunking and indexing for knowledge retrieval

- Agentic Workflow Infrastructure: durable execution systems (Temporal) for multi-step LLM agent workflows with retry and compensation logic

- Voice AI Backend: ASR request routing, low-latency TTS pipelines, spoken intent API design for conversational booking flows

- Recommendation API Design: serving infrastructure for collaborative filtering, session-based models, and personalised ranking endpoints

- Price Intelligence Pipelines: real-time competitive price ingestion, fare change event streaming, lower-price guarantee trigger systems

- A/B Experiment Infrastructure: feature flags, traffic splitting, metric collection, and experiment configuration systems

- MCP Tool Orchestration: building the tool-use APIs that LLM agents call to execute booking, modify, and cancel operations safely

Infrastructure & Scale Context :

The systems you will work on or depend on :

Compute :

- AWS EKS (Kubernetes) 500+ pods, autoscaling, spot instance optimisation

Cache :

Multi-tier : L1 in-process (Caffeine) L2 Redis Cluster L3 CDN edge cache; 10M+ keys/sec

Streaming :

Apache Kafka 100+ topics, 5M+ events/sec, consumer lag SLOs < 500ms

Storage :

Polyglot : DynamoDB (booking state) Aerospike (fare cache) Elasticsearch (search) Aurora (OLTP)

Observability :

- OpenTelemetry traces Jaeger; Prometheus metrics Grafana; structured logs Loki; SLO dashboards

ML Serving :

Real-time : TorchServe / Triton on GPU nodes, p99 < 20ms.

Batch : Spark on EMR

Feature Store :

- Feast 300+ features, online (Redis) + offline (S3/Hive), sub-10ms online reads

APIs :

- REST + gRPC; 200+ internal services; 50M+ external API calls/day; contract testing via Pact

Security :

- OAuth2/JWT, Vault for secrets, AWS IAM, zero-trust internal service mesh

Who You Are :

- 7 to 12 years in backend engineering; have owned at least one Tier-1 production system from design through sustained operation

- Deep knowledge of distributed systems: you can reason about consistency, partition tolerance, and failure modes from first principles

- Production experience with Kafka, Redis, gRPC, and cloud-native infrastructure you know what breaks and why

- ML integration experience: you have built or maintained a real-time ML serving integration in a production API path

- Strong communicator: your design docs are read, your code comments explain intent, your incident post-mortems are actioned

- Tier-I institute preferred (IIT / IIIT / NIT / IISC / BITS/CSE / ISE)

Technology Stack :


Backend : Java, Kotlin, Go, Spring Boot, Ktor

Streaming : Apache Kafka, Kafka Streams, Flink (awareness), Avro/Protobuf

Cache/Storage : Redis Cluster, DynamoDB, Aurora MySQL, Aerospike, Elasticsearch

ML Infra : TorchServe/Triton (integration), Feast (online reads), Temporal, FastAPI

Cloud / Infra : AWS EKS, Terraform, Helm, ArgoCD, AWS RDS/DynamoDB

Observability : OpenTelemetry, Prometheus, Grafana, Jaeger, PagerDuty


info-icon

Did you find something suspicious?

Similar jobs that you might be interested in