Posted on: 10/04/2026
Description :
- Principal Software Engineer Platform Engineering
- 10 to 15 years
- Architecture at Scale
- Distributed Systems
- Cross-Platform Technical Strategy
- AI Infrastructure
The Mandate :
- A Principal Engineer's decisions are measured in years, not sprints.
- The architecture you design today will carry millions of bookings in 2028.
- The platform primitives you define will be used by 50 engineers who haven't joined yet.
- The reliability patterns you establish will determine whether the system survives its worst-ever traffic day.
- You are the technical north star for an engineering organisation building national-scale travel infrastructure.
- Your RFC becomes the standard.
- Your code review raises the team.
- Your architecture decision holds for 35 years.
- You operate across all six platforms identifying shared patterns, preventing duplicated complexity, and defining the abstractions that make the whole org faster.
- You are expected to code, to lead, to mentor, and to be right more often than anyone else while remaining open to being wrong.
- Every role in this document works across all six verticals.
- You own the intelligence layer not a single product.
- The problems are distinct per platform but the data, infrastructure, and AI systems are shared.
- You will build models and systems used by Flights AND Hotels AND Bus AND Train AND B2B AND Core simultaneously.
- This is what separates this opportunity from a single-vertical role: cross-domain ML and platform leverage at national scale.
What You Will Own :
- Architectural Strategy
- Define and evolve the distributed systems architecture across all six platforms establishing shared patterns for event-driven workflows, cache strategies, API contracts, and service communication
- Own cross-cutting architectural concerns: data consistency protocols, distributed transaction patterns (Saga, TCC, outbox), inter-service contract versioning
- Drive technology selection at org level: evaluate trade-offs between competing databases, messaging systems, cache layers, and ML serving infrastructure with long-term operational cost in mind
- Author RFCs and ADRs that define platform-wide standards; lead design reviews for all Tier-1 system changes
- Identify and articulate structural technical debt: quantify the risk, design the migration path, align stakeholders, and oversee execution
- Distributed Systems Excellence
- Design the booking consistency protocol: exactly-once semantics across payment gateway, inventory reservation, and airline/hotel confirmation using idempotency keys and the transactional outbox pattern
- Architect the multi-tier caching strategy: L1 in-process (Caffeine) L2 Redis cluster L3 CDN; define TTL policies per data type, invalidation event contracts, and stampede prevention
- Design the Kafka topology: topic partitioning strategy, consumer group isolation, schema evolution with Avro/Protobuf, compaction policies, and dead-letter queue patterns
- Build the observability architecture: distributed tracing (OpenTelemetry), structured logging standards, SLO/SLI definitions, error budget policies, and alerting philosophy
- Define the reliability engineering standards: circuit breaker configurations, bulkhead patterns, chaos engineering schedule, and SLA/SLO framework for external integrations
- AI/ML Infrastructure Architecture
- Design the ML serving architecture: model registry, A/B serving (shadow + canary), request batching, GPU autoscaling, and fallback heuristics for model unavailability
- Architect the real-time feature pipeline: streaming feature computation (Flink/Kafka Streams), feature store integration (Feast), and latency budget allocation for model inference
- Define the RAG infrastructure standard: vector store selection, embedding pipeline design, document chunking strategy, and retrieval quality evaluation framework
- Design the agentic workflow execution infrastructure: durable execution patterns (Temporal), tool-use API contracts, safety guardrails, and rollback mechanisms
- Architect the Voice AI backend: ASR request routing, TTS pipeline, intent-to-booking-action mapping, and conversational session state management
- Platform Work Across All Six Verticals
- Platform
- Engineering Problem
- What You'll Own & Build
Flights :
- Sub-100ms distributed search at 500M queries/day
- Architect the multi-stage query pipeline: fare retrieval ? pricing ? ranking ? serialisation; define the caching strategy for airline inventory data
- Hotels
- Multi-supplier price consistency at booking time
- Design the rate aggregation architecture; define the event-driven invalidation model; own the consistency protocol between display price and booking price
- Bus
- Reliability on top of 5,000+ operator chaos
- Architect the operator abstraction layer; define the health-scoring model for automatic fallback routing; design the booking state machine for async confirmation flows
- Train
- Concurrent seat allocation for Tatkal demand spikes
- Design the distributed locking model; define the idempotency protocol for IRCTC integration; architect the queue-based fairness system for high-demand booking windows
- B2B
- Configurable policy engine for 10,000+ enterprise clients
- Architect the multi-tenant policy evaluation system; define the rule engine abstraction; design the audit-immutable event log for compliance Core
- Platform primitives adopted by 200+ services
- Design the service framework SDK; define the event schema contracts; architect the observability pipeline from OpenTelemetry to Grafana SLO dashboards
The Hard Engineering Problems You'll Face :
- Across all six platforms, the engineering challenges are real, non-trivial, and consequential :
- Cache Invalidation at Speed
- Fare data has a 30-second freshness window.
- A stale cache hit in the booking flow means a pricing error, a failed checkout, or a lost trust signal.
- Multi-tier cache design (L1/L2/L3), TTL strategies, event-driven invalidation via Kafka, and cache stampede prevention are all live problems.
- Distributed Concurrency
- Train Tatkal opening: millions of concurrent writes for 72 berths per coach.
- Optimistic locking, distributed lease management, queue-based fairness, and atomic seat allocation without deadlock under pathological load.
- Event Ordering Guarantees
- A booking event must arrive before its payment event.
- But Kafka doesn't guarantee cross-partition ordering.
- Building booking state machines with idempotency, deduplication, and out-of-order event tolerance is a continuous engineering challenge.
- Latency Budgets
- Search must return in <100ms.
- Blowing any budget breaks the SLA.
- External API Unreliability
- IRCTC, GDS providers, and hotel APIs are not your SREs' best friends.
- Circuit breakers, bulkheads, adaptive retries, fallback strategies, and health-scoring for external dependencies are required, not optional.
- Observability Gaps
- 200+ services.
- A booking fails.
- The trace crosses 8 service boundaries.
- Without distributed tracing (OpenTelemetry Jaeger), structured logging (correlation IDs, trace context propagation), and SLO dashboards (Prometheus + Grafana), debugging is archaeology, not engineering.
- Multi-Tenancy Blast Radius
- A B2B enterprise client's policy engine change must not affect the B2C booking flow.
- Multi-tenant isolation in shared infrastructure (API gateways, Kafka topics, DB schemas, cache namespaces) must be designed from day one.
- AI Model Integration
- Serving a ranking model in the search critical path at p99 <20ms requires GPU node management, model warmup, request batching, async inference patterns, and fallback to heuristic ranking when the model is unavailable.
AI-First Engineering Mandate :
- Platform Engineers at every level are responsible for building systems that AI and ML can run on and increasingly, systems that are AI themselves.
- ML Serving Infrastructure: your APIs must serve model predictions at p99 <20ms with graceful fallbacks you design the latency budget allocation
- Feature Pipeline Engineering: real-time feature computation (Kafka Streams, Flink) feeding the feature store at sub-second freshness
- RAG Backend Systems: vector store integration, embedding generation pipelines, document chunking and indexing for knowledge retrieval
- Agentic Workflow Infrastructure: durable execution systems (Temporal) for multi-step LLM agent workflows with retry and compensation logic
- Voice AI Backend: ASR request routing, low-latency TTS pipelines, spoken intent API design for conversational booking flows
- Recommendation API Design: serving infrastructure for collaborative filtering, session-based models, and personalised ranking endpoints
- Price Intelligence Pipelines: real-time competitive price ingestion, fare change event streaming, lower-price guarantee trigger systems
- A/B Experiment Infrastructure: feature flags, traffic splitting, metric collection, and experiment configuration systems
- MCP Tool Orchestration: building the tool-use APIs that LLM agents call to execute booking, modify, and cancel operations safely
Infrastructure & Scale Context :
- The systems you will work on or depend on :
- Compute
- AWS EKS (Kubernetes) 500+ pods, autoscaling, spot instance optimisation
- Cache
- Multi-tier: L1 in-process (Caffeine) L2 Redis Cluster L3 CDN edge cache; 10M+ keys/sec
- Streaming
- Apache Kafka
- Storage
- Polyglot : DynamoDB (booking state)
- Aerospike (fare cache)
- Elasticsearch (search)
- Aurora (OLTP)
- Observability
- OpenTelemetry traces ? Jaeger; Prometheus metrics ? Grafana; structured logs ? Loki; SLO dashboards
- ML Serving
- Real-time: TorchServe / Triton on GPU nodes, p99 < 20ms.
- Batch: Spark on EMR
- Feature Store
- Feast 300+ features, online (Redis) + offline (S3/Hive), sub-10ms online reads
- APIs
- REST + gRPC; 200+ internal services; 50M+ external API calls/day; contract testing via Pact
- Security
- OAuth2/JWT, Vault for secrets, AWS IAM, zero-trust internal service mesh
Technical Depth Required :
- Distributed systems mastery: CAP/PACELC, consensus (Raft/Paxos awareness), eventual consistency, CRDTs, vector clocks, distributed sagas
- High-throughput API design: backpressure, rate limiting, request hedging, tail latency optimisation, adaptive timeout patterns
- JVM performance: GC tuning, heap sizing, thread pool configuration, profiling (async-profiler, JFR), off-heap optimisation
- Kafka internals: partition leadership, consumer group rebalancing, ISR management, log compaction, exactly-once semantics
- Cache system design: Redis cluster topology, Lua scripting for atomic operations, pub/sub for invalidation, keyspace notifications
- Observability engineering: trace sampling strategies, metric cardinality management, log aggregation at scale, SLO mathematics
- ML serving: GPU scheduling, model parallelism, KV cache for LLM inference, quantisation trade-offs, serving latency profiling
- Security: zero-trust service mesh, mTLS, JWT validation patterns, secrets rotation, OWASP API security at scale
Who You Are :
- 10 to 15 years in backend/distributed systems engineering; have designed and delivered tier-1 systems handling 100M+ users
- Authored RFCs, ADRs, or design documents that became organisation-wide standards not just proposals that gathered dust
- Deep expertise in at least three of: distributed systems, event streaming, ML infrastructure, high-throughput APIs, cache architecture, observability engineering
- You code: your PRs are merged, reviewed, and used you are not purely advisory
- Track record of identifying systemic problems before they become incidents and fixing them without breaking production
- Tier-I institute strongly preferred (IIT / IIIT / NIT / IISC / BITS CSE / ISE / ECE)
Technology Stack :
- Backend: Java Kotlin Go gRPC REST GraphQL (awareness)
- Architecture: Microservices Event-Driven CQRS/ES Saga pattern Outbox pattern
- Streaming: Apache Kafka Apache Flink Kafka Streams Schema Registry (Avro/Protobuf)
- Cache: Redis Cluster Aerospike Caffeine (in-process) CDN edge caching
- Storage: DynamoDB Aurora MySQL/PostgreSQL Elasticsearch S3 Delta Lake
- ML Infra: Triton Inference Server TorchServe Feast vLLM Temporal
- Cloud / Infra: AWS (EKS, EC2, RDS, SQS, SNS, Lambda) Terraform Helm ArgoCD
- Observability: OpenTelemetry Prometheus Grafana Jaeger Loki PagerDuty
Did you find something suspicious?
Posted by
Posted in
DevOps / SRE
Functional Area
DevOps / Cloud
Job Code
1627613