Description :
Lead Software Engineer - Platform Engineering
- 7 to 10 years
Delivery Ownership :
- Hands On Engineering
- API Design
- Team Technical Health
The Role :
- Lead Engineers make delivery happen reliably, with quality, at pace.
- You are the person who takes a complex requirement, breaks it into well-scoped engineering tasks, assigns and reviews the work, makes the design calls for your scope, and ships.
You are hands on :
- Your code is in production.
You are a multiplier :
- The engineers around you are faster and better because of your presence.
- Own technical delivery for a feature area or service cluster within a platform vertical
- Hands on : 4060% of your time is writing and reviewing production code
- Your API and data model decisions are sound enough that Staff doesn't need to revisit them
- The Engineers on your immediate team grow through your code reviews, pairing, and feedback
Core Responsibilities :
Technical Delivery :
- Own the technical design for your scope : API contracts, data models, service boundaries, caching strategies, and event schemas documented in ADRs or lightweight design docs
- Write production-grade code in Java/Kotlin/Go : clean, tested, observable, and deployable without heroics
- Lead code reviews for your team: catching logic errors and design issues, but also teaching better patterns your reviews reduce bugs in future PRs, not just the current one
- Break down complex requirements into well-scoped, estimated engineering tasks with clear acceptance criteria no ambiguous user stories on your board
- Manage technical risk: identify dependencies, flag blockers early, coordinate with adjacent teams, and update timelines before they slip
Service Reliability :
- Own the reliability of your services: define SLOs with your tech lead, build monitoring dashboards, write runbooks, and participate in on-call rotation
- Conduct performance testing before major launches: load profiles, latency percentiles (p50/p95/p99), error rates under load, and saturation points
- Manage service-level technical debt: identify, quantify, schedule, and deliver tech debt reduction sprints without dropping feature velocity
AI/ML Feature Delivery :
- Deliver AI-powered features end-to-end: integrate ranking model APIs, build recommendation display endpoints, implement LLM response handling with timeout and fallback
- Build the instrumentation for ML model evaluation: log prediction inputs, outputs, and user feedback signals that feed offline model evaluation pipelines
- Implement feature flag controls for ML model rollout: gradual traffic splitting, kill-switch safety, and metric comparison dashboards
- Deliver Voice AI integrations: ASR API integration, spoken intent response handling, voice booking session state management
Platform Work Across All Six Verticals :
What You'll Own & Build :
- Flights
- Pricing API latency under GDS API variability
- Lead the fare pricing service: request fan-out to multiple GDS providers, parallel aggregation, timeout handling, and cached fallback pricing
Hotels :
- Supplier rate normalisation across 50+ feed formats
- Lead the rate normalisation service: schema mapping, validation, deduplication, and the transformation pipeline feeding the unified rate store
Bus :
- Seat map real-time consistency across aggregators
- Lead the seat availability service: real-time seat map fetching, TTL-based staleness management, partial availability handling, and booking lockout on seat conflict
Train :
- PNR lifecycle state machine reliability
- Lead the PNR management service: booking confirmation, modification, cancellation, and refund state transitions with idempotent IRCTC API calls
B2B :
- Approval workflow SLA under asynchronous delegation
- Lead the approval engine: configurable routing rules, escalation triggers, delegation management, SLA countdown, and notification delivery guarantees
Core :
- Feature flag rollout safety for platform-wide configs
- Lead the feature flag service: gradual rollout controls, audience targeting, kill-switch safety, and experiment integration with the A/B platform
The Hard Engineering Problems You'll Face :
Across all six platforms, the engineering challenges are real, non-trivial, and consequential:
Cache Invalidation at Speed :
- Fare data has a 30-second freshness window.
- A stale cache hit in the booking flow means a pricing error, a failed checkout, or a lost trust signal.
- Multi-tier cache design (L1/L2/L3), TTL strategies, event-driven invalidation via Kafka, and cache stampede prevention are all live problems.
Distributed Concurrency :
- Train Tatkal opening : millions of concurrent writes for 72 berths per coach.
- Optimistic locking, distributed lease management, queue-based fairness, and atomic seat allocation without deadlock under pathological load.
Event Ordering Guarantees :
- A booking event must arrive before its payment event.
- But Kafka doesn't guarantee cross-partition ordering.
- Building booking state machines with idempotency, deduplication, and out-of-order event tolerance is a continuous engineering challenge.
Observability Gaps :
- 200+ services.
- A booking fails.
- The trace crosses 8 service boundaries.
- Without distributed tracing (OpenTelemetry/Jaeger), structured logging (correlation IDs, trace context propagation), and SLO dashboards (Prometheus + Grafana), debugging is archaeology, not engineering.
Multi-Tenancy Blast Radius :
- A B2B enterprise client's policy engine change must not affect the B2C booking flow.
- Multi-tenant isolation in shared infrastructure (API gateways, Kafka topics, DB schemas, cache namespaces) must be designed from day one.
AI Model Integration :
Serving a ranking model in the search critical path at p99 <20ms requires GPU node management, model warmup, request batching, async inference patterns, and fallback to heuristic ranking when the model is unavailable.
AI-First Engineering Mandate :
- Platform Engineers at every level are responsible for building systems that AI and ML can run on and increasingly, systems that are AI themselves.
- ML Serving Infrastructure: your APIs must serve model predictions at p99 <20ms with graceful fallbacks you design the latency budget allocation
- Feature Pipeline Engineering: real-time feature computation (Kafka Streams, Flink) feeding the feature store at sub-second freshness
- RAG Backend Systems: vector store integration, embedding generation pipelines, document chunking and indexing for knowledge retrieval
Agentic Workflow Infrastructure: durable execution systems (Temporal) for multi-step LLM agent workflows with retry and compensation logic
Voice AI Backend: ASR request routing, low-latency TTS pipelines, spoken intent API design for conversational booking flows
Recommendation API Design: serving infrastructure for collaborative filtering, session-based models, and personalised ranking endpoints
Price Intelligence Pipelines: real-time competitive price ingestion, fare change event streaming, lower-price guarantee trigger systems
A/B Experiment Infrastructure: feature flags, traffic splitting, metric collection, and experiment configuration systems
MCP Tool Orchestration: building the tool-use APIs that LLM agents call to execute booking, modify, and cancel operations safely
Who You Are :
- 7 to 10 years in backend engineering; have delivered at least one major feature or service from design through production monitoring
- Strong in Java or Kotlin; comfortable with Spring Boot, microservices, REST, and async messaging
- Experience with Kafka, Redis, and at least one cloud platform (AWS preferred)
- Natural collaborator: you align engineers, product managers, and QA without creating overhead or slowing the team
- You write design documents that are read before coding starts, not filed after the fact
- Tier-I institute preferred (IIT / IIIT / NIT / IISC / BITS CSE / ISE)
Technology Stack :
Backend : Java, Kotlin, Spring Boot, Ktor, Go (exposure)
Messaging : Apache Kafka, gRPC, REST APIs, Avro/Protobuf
Storage : Redis, MySQL/Aurora, DynamoDB, Elasticsearch
AI Integration : REST/gRPC model endpoints, Feature flag SDKs, OpenTelemetry tracing
DevOps : Kubernetes, Docker, Helm, CI/CD (GitHub Actions/Jenkins), Terraform (awareness)
Did you find something suspicious?
Posted by
Posted in
Backend Development
Functional Area
Technical / Solution Architect
Job Code
1627591