Description :

About the role :

The L3 Production Support Engineer is a backend-focused full-stack incident SME responsible for owning complex production incidents, driving root cause analysis, and implementing systemic improvements for the agentic on-call management platform. This role bridges incident command, deep backend engineering, and targeted frontend troubleshooting to ensure platform reliability at scale.

What Youll Do :

Incident Management & Leadership :

- Own Sev-1/Sev-2 incident response as incident commander or lead resolver, driving swift diagnosis and resolution

- Lead post-incident RCAs, identifying systemic issues and driving long-term fixes across backend, infrastructure, and UI

- Establish and refine incident response playbooks, runbooks, and escalation procedures

- Participate in on-call rotation as primary/secondary responder with accountability for critical systems

Backend & Infrastructure Expertise :

- Perform deep production troubleshooting : log analysis, distributed tracing, metric correlation, and profiling under pressure

- Diagnose and fix complex issues across microservices : scheduling engine, LLM orchestration, notification pipeline, and integrations

- Optimize database queries, identify locking issues, and manage migrations in PostgreSQL under production constraints

- Architect and implement Redis caching, rate limiting, and queue-based patterns for reliability and scale

- Work with Kubernetes, container orchestration, and deployment pipelines manage rollbacks and feature toggles during incidents

Full-Stack Incident Resolution :

- Resolve end-to-end incidents regardless of origin (backend API, database, LLM vendor, or React frontend)

- Debug and ship targeted React fixes when UI is the fastest path to incident resolution

- Drive code-level improvements in backend services (Python/FastAPI) to harden agent flows, retry logic, and error handling

- Collaborate closely with dev teams on defects, performance bottlenecks, and architecture-level changes

Observability & Continuous Improvement :

- Design and tune monitoring, alerting, and SLO/SLI frameworks for the platform

- Maintain and evolve critical runbooks, playbooks, and knowledge base entries as patterns emerge

- Mentor L2 engineers on deep troubleshooting, escalation discipline, and incident best practices

- Drive blameless post-mortems and systemic risk reduction across the platform

On your first day, we'll expect you to have :

Backend (Primary Focus) :

- 5 - 8+ years in backend engineering with strong hands-on experience in Python/FastAPI or equivalent

- Deep knowledge of async APIs, background jobs, message queues (Celery, RabbitMQ, or similar), and distributed scheduling

- Production-grade database skills : PostgreSQL query optimization, locking, migrations, and performance tuning

- Redis expertise : caching patterns, rate limiting, streams, and pub/sub for realtime systems

- Strong observability and on-call mindset : designing alerts, understanding SLOs/SLIs, error budgets, and Sev definitions

- Proficiency with Kubernetes, Docker, container orchestration, and CI/CD pipelines (Jenkins, Bitbucket, GitHub Actions)

- Understanding of cloud infrastructure (Azure preferred) and networking fundamentals

LLM & Agentic Systems :

- Solid grasp of LLM orchestration concepts : prompt engineering, tool-calling, context windows, rate limits, and vendor-specific behavior

- Experience with LLM failure modes : hallucinations, token limits, timeout patterns, and cost/latency tradeoffs

- Knowledge of agent frameworks (LangGraph, similar) and how they compose across microservices

- Ability to debug LLM-driven flows : tracing prompts, understanding retry/backoff behavior, and validating tool outputs

Full-Stack (Secondary but Required) :

- 2 - 3+ years hands-on with React and TypeScript in production environments

- Competency reading and modifying existing React code : components, hooks, routing, state management (Redux/Context)

- Browser debugging skills : DevTools, React DevTools, network throttling, and performance profiling

- Ability to implement targeted UI fixes : form validation, error handling, API error display, and minor UX hardening

- Familiarity with frontend build pipelines : Webpack/Vite, environment configs, feature flags, and deployment strategies

Logging, Metrics & Troubleshooting :

- Expert-level log parsing and correlation across services using structured logging (JSON, correlation IDs)

- Proficiency with observability platforms (Prometheus, Grafana, Datadog, New Relic, or similar)

- Ability to construct and execute production queries under incident time pressure

- Strong shell scripting (bash/Python) for diagnostics, automation, and custom monitoring

Required Soft Skills :

- Incident command maturity : composure under pressure, clear communication, and decisive decision-making during critical outages

- Technical depth with breadth : deep backend knowledge + sufficient full-stack awareness to own end-to-end incidents

- Mentorship mindset : capable of raising L2 engineers through code review, pairing, and RCA participation

- Documentation discipline : ability to capture runbooks, architecture decisions, and lessons learned clearly

- Cross-functional collaboration : working effectively with dev, SRE, platform, and business teams during incidents

Experience Requirements :

- Minimum 6 - 10 years in backend/platform/SRE roles with at least 3+ years in production support, incident response, or on-call engineering

- Proven track record leading Sev-1/Sev-2 incidents in distributed, multi-service systems

- Experience with at least one agentic AI or LLM-integrated product (customer facing or internal tools)

- Comfortable with continuous on-call rotation and on-demand availability for critical incident