HamburgerMenu
hirist

Principal Software Engineer - Cluster Management

Recruiting Bond
10 - 15 Years
Bangalore

Posted on: 10/04/2026

Job Description

Description :

- Principal Software Engineer Platform Engineering

- 10 to 15 years

- Architecture at Scale


- Distributed Systems


- Cross-Platform Technical Strategy


- AI Infrastructure

The Mandate :

- A Principal Engineer's decisions are measured in years, not sprints.

- The architecture you design today will carry millions of bookings in 2028.

- The platform primitives you define will be used by 50 engineers who haven't joined yet.

- The reliability patterns you establish will determine whether the system survives its worst-ever traffic day.

- You are the technical north star for an engineering organisation building national-scale travel infrastructure.

- Your RFC becomes the standard.

- Your code review raises the team.

- Your architecture decision holds for 35 years.

- You operate across all six platforms identifying shared patterns, preventing duplicated complexity, and defining the abstractions that make the whole org faster.

- You are expected to code, to lead, to mentor, and to be right more often than anyone else while remaining open to being wrong.

- Every role in this document works across all six verticals.

- You own the intelligence layer not a single product.

- The problems are distinct per platform but the data, infrastructure, and AI systems are shared.

- You will build models and systems used by Flights AND Hotels AND Bus AND Train AND B2B AND Core simultaneously.

- This is what separates this opportunity from a single-vertical role: cross-domain ML and platform leverage at national scale.

What You Will Own :

- Architectural Strategy

- Define and evolve the distributed systems architecture across all six platforms establishing shared patterns for event-driven workflows, cache strategies, API contracts, and service communication

- Own cross-cutting architectural concerns: data consistency protocols, distributed transaction patterns (Saga, TCC, outbox), inter-service contract versioning

- Drive technology selection at org level: evaluate trade-offs between competing databases, messaging systems, cache layers, and ML serving infrastructure with long-term operational cost in mind

- Author RFCs and ADRs that define platform-wide standards; lead design reviews for all Tier-1 system changes

- Identify and articulate structural technical debt: quantify the risk, design the migration path, align stakeholders, and oversee execution

- Distributed Systems Excellence

- Design the booking consistency protocol: exactly-once semantics across payment gateway, inventory reservation, and airline/hotel confirmation using idempotency keys and the transactional outbox pattern

- Architect the multi-tier caching strategy: L1 in-process (Caffeine) L2 Redis cluster L3 CDN; define TTL policies per data type, invalidation event contracts, and stampede prevention

- Design the Kafka topology: topic partitioning strategy, consumer group isolation, schema evolution with Avro/Protobuf, compaction policies, and dead-letter queue patterns

- Build the observability architecture: distributed tracing (OpenTelemetry), structured logging standards, SLO/SLI definitions, error budget policies, and alerting philosophy

- Define the reliability engineering standards: circuit breaker configurations, bulkhead patterns, chaos engineering schedule, and SLA/SLO framework for external integrations

- AI/ML Infrastructure Architecture

- Design the ML serving architecture: model registry, A/B serving (shadow + canary), request batching, GPU autoscaling, and fallback heuristics for model unavailability

- Architect the real-time feature pipeline: streaming feature computation (Flink/Kafka Streams), feature store integration (Feast), and latency budget allocation for model inference

- Define the RAG infrastructure standard: vector store selection, embedding pipeline design, document chunking strategy, and retrieval quality evaluation framework

- Design the agentic workflow execution infrastructure: durable execution patterns (Temporal), tool-use API contracts, safety guardrails, and rollback mechanisms

- Architect the Voice AI backend: ASR request routing, TTS pipeline, intent-to-booking-action mapping, and conversational session state management

- Platform Work Across All Six Verticals

- Platform

- Engineering Problem

- What You'll Own & Build

Flights :

- Sub-100ms distributed search at 500M queries/day

- Architect the multi-stage query pipeline: fare retrieval ? pricing ? ranking ? serialisation; define the caching strategy for airline inventory data

- Hotels

- Multi-supplier price consistency at booking time

- Design the rate aggregation architecture; define the event-driven invalidation model; own the consistency protocol between display price and booking price

- Bus

- Reliability on top of 5,000+ operator chaos

- Architect the operator abstraction layer; define the health-scoring model for automatic fallback routing; design the booking state machine for async confirmation flows

- Train

- Concurrent seat allocation for Tatkal demand spikes

- Design the distributed locking model; define the idempotency protocol for IRCTC integration; architect the queue-based fairness system for high-demand booking windows

- B2B

- Configurable policy engine for 10,000+ enterprise clients

- Architect the multi-tenant policy evaluation system; define the rule engine abstraction; design the audit-immutable event log for compliance Core

- Platform primitives adopted by 200+ services

- Design the service framework SDK; define the event schema contracts; architect the observability pipeline from OpenTelemetry to Grafana SLO dashboards

The Hard Engineering Problems You'll Face :

- Across all six platforms, the engineering challenges are real, non-trivial, and consequential :

- Cache Invalidation at Speed

- Fare data has a 30-second freshness window.

- A stale cache hit in the booking flow means a pricing error, a failed checkout, or a lost trust signal.

- Multi-tier cache design (L1/L2/L3), TTL strategies, event-driven invalidation via Kafka, and cache stampede prevention are all live problems.

- Distributed Concurrency

- Train Tatkal opening: millions of concurrent writes for 72 berths per coach.

- Optimistic locking, distributed lease management, queue-based fairness, and atomic seat allocation without deadlock under pathological load.

- Event Ordering Guarantees

- A booking event must arrive before its payment event.

- But Kafka doesn't guarantee cross-partition ordering.

- Building booking state machines with idempotency, deduplication, and out-of-order event tolerance is a continuous engineering challenge.

- Latency Budgets

- Search must return in <100ms.

- Blowing any budget breaks the SLA.


- External API Unreliability

- IRCTC, GDS providers, and hotel APIs are not your SREs' best friends.

- Circuit breakers, bulkheads, adaptive retries, fallback strategies, and health-scoring for external dependencies are required, not optional.

- Observability Gaps

- 200+ services.

- A booking fails.

- The trace crosses 8 service boundaries.

- Without distributed tracing (OpenTelemetry Jaeger), structured logging (correlation IDs, trace context propagation), and SLO dashboards (Prometheus + Grafana), debugging is archaeology, not engineering.

- Multi-Tenancy Blast Radius

- A B2B enterprise client's policy engine change must not affect the B2C booking flow.

- Multi-tenant isolation in shared infrastructure (API gateways, Kafka topics, DB schemas, cache namespaces) must be designed from day one.

- AI Model Integration

- Serving a ranking model in the search critical path at p99 <20ms requires GPU node management, model warmup, request batching, async inference patterns, and fallback to heuristic ranking when the model is unavailable.

AI-First Engineering Mandate :

- Platform Engineers at every level are responsible for building systems that AI and ML can run on and increasingly, systems that are AI themselves.

- ML Serving Infrastructure: your APIs must serve model predictions at p99 <20ms with graceful fallbacks you design the latency budget allocation

- Feature Pipeline Engineering: real-time feature computation (Kafka Streams, Flink) feeding the feature store at sub-second freshness

- RAG Backend Systems: vector store integration, embedding generation pipelines, document chunking and indexing for knowledge retrieval

- Agentic Workflow Infrastructure: durable execution systems (Temporal) for multi-step LLM agent workflows with retry and compensation logic

- Voice AI Backend: ASR request routing, low-latency TTS pipelines, spoken intent API design for conversational booking flows

- Recommendation API Design: serving infrastructure for collaborative filtering, session-based models, and personalised ranking endpoints

- Price Intelligence Pipelines: real-time competitive price ingestion, fare change event streaming, lower-price guarantee trigger systems

- A/B Experiment Infrastructure: feature flags, traffic splitting, metric collection, and experiment configuration systems

- MCP Tool Orchestration: building the tool-use APIs that LLM agents call to execute booking, modify, and cancel operations safely

Infrastructure & Scale Context :

- The systems you will work on or depend on :

- Compute

- AWS EKS (Kubernetes) 500+ pods, autoscaling, spot instance optimisation

- Cache

- Multi-tier: L1 in-process (Caffeine) L2 Redis Cluster L3 CDN edge cache; 10M+ keys/sec

- Streaming

- Apache Kafka

- Storage

- Polyglot : DynamoDB (booking state)


- Aerospike (fare cache)


- Elasticsearch (search)


- Aurora (OLTP)

- Observability

- OpenTelemetry traces ? Jaeger; Prometheus metrics ? Grafana; structured logs ? Loki; SLO dashboards

- ML Serving

- Real-time: TorchServe / Triton on GPU nodes, p99 < 20ms.

- Batch: Spark on EMR

- Feature Store

- Feast 300+ features, online (Redis) + offline (S3/Hive), sub-10ms online reads

- APIs

- REST + gRPC; 200+ internal services; 50M+ external API calls/day; contract testing via Pact

- Security

- OAuth2/JWT, Vault for secrets, AWS IAM, zero-trust internal service mesh

Technical Depth Required :

- Distributed systems mastery: CAP/PACELC, consensus (Raft/Paxos awareness), eventual consistency, CRDTs, vector clocks, distributed sagas

- High-throughput API design: backpressure, rate limiting, request hedging, tail latency optimisation, adaptive timeout patterns

- JVM performance: GC tuning, heap sizing, thread pool configuration, profiling (async-profiler, JFR), off-heap optimisation

- Kafka internals: partition leadership, consumer group rebalancing, ISR management, log compaction, exactly-once semantics

- Cache system design: Redis cluster topology, Lua scripting for atomic operations, pub/sub for invalidation, keyspace notifications

- Observability engineering: trace sampling strategies, metric cardinality management, log aggregation at scale, SLO mathematics

- ML serving: GPU scheduling, model parallelism, KV cache for LLM inference, quantisation trade-offs, serving latency profiling

- Security: zero-trust service mesh, mTLS, JWT validation patterns, secrets rotation, OWASP API security at scale

Who You Are :

- 10 to 15 years in backend/distributed systems engineering; have designed and delivered tier-1 systems handling 100M+ users

- Authored RFCs, ADRs, or design documents that became organisation-wide standards not just proposals that gathered dust

- Deep expertise in at least three of: distributed systems, event streaming, ML infrastructure, high-throughput APIs, cache architecture, observability engineering

- You code: your PRs are merged, reviewed, and used you are not purely advisory

- Track record of identifying systemic problems before they become incidents and fixing them without breaking production

- Tier-I institute strongly preferred (IIT / IIIT / NIT / IISC / BITS CSE / ISE / ECE)

Technology Stack :

- Backend: Java Kotlin Go gRPC REST GraphQL (awareness)

- Architecture: Microservices Event-Driven CQRS/ES Saga pattern Outbox pattern

- Streaming: Apache Kafka Apache Flink Kafka Streams Schema Registry (Avro/Protobuf)

- Cache: Redis Cluster Aerospike Caffeine (in-process) CDN edge caching

- Storage: DynamoDB Aurora MySQL/PostgreSQL Elasticsearch S3 Delta Lake

- ML Infra: Triton Inference Server TorchServe Feast vLLM Temporal

- Cloud / Infra: AWS (EKS, EC2, RDS, SQS, SNS, Lambda) Terraform Helm ArgoCD

- Observability: OpenTelemetry Prometheus Grafana Jaeger Loki PagerDuty


info-icon

Did you find something suspicious?

Similar jobs that you might be interested in