Description :

Role : SRE Team Lead

Experience : 8 - 12 years

Domain : Fintech | Microservices | Cloud-native

Role Summary :

We are looking for a hands-on SRE Team Lead to own the reliability, scalability, and operational excellence of a cloud-native fintech platform built on microservices. This role combines technical leadership, architecture ownership, and deep hands-on execution.

You will lead a small SRE team while remaining actively involved in design, coding, incident response, and reliability engineering.

Key Responsibilities :

Reliability & Architecture :

- Own platform availability, latency, scalability, and resilience across environments

- Define and enforce SLOs, SLIs, error budgets, and operational KPIs

- Design and review resilience patterns : circuit breakers, retries, rate limiting, graceful degradation

- Drive chaos engineering, fault-injection, and disaster-recovery readiness

Hands-on Engineering :

- Actively contribute code (Java / Node) for :

1. Reliability tooling

2. Platform automation

3. Observability integrations

- Review microservice architecture with engineering teams to eliminate single points of failure

Cloud & DevOps Leadership :

- Own AWS architecture (VPCs, IAM, EKS, RDS, ALB/NLB, autoscaling)

- Drive Kubernetes best practices (resource tuning, HPA, pod disruption budgets)

- Improve CI/CD pipelines for reliability, speed, and safety

Incident & Operations :

- Lead production incident response, root cause analysis (RCA), and postmortems

- Establish blameless postmortem culture

- Reduce MTTR through automation and better observability

- Participate in escalation/on-call strategy (not firefighting 247)

People & Process :

- Mentor SRE DevOps and SRE Full-Stack engineers

- Define operational standards, runbooks, and SRE practices

- Work closely with product, security, and engineering leaders

Required Skills & Experience :

- 8+ years of experience in SRE / Platform / DevOps engineering

- Strong hands-on experience with :

1. AWS (EKS, EC2, RDS, IAM, CloudWatch, ALB)

2. Kubernetes & Docker

3. Microservices architectures

- Strong programming background in Java and/or Node.js

- Deep understanding of :

1. Distributed systems

2. Production debugging

3. Capacity planning

- Experience in fintech or regulated environments is a strong plus

Nice to Have :

- Experience with chaos engineering tools

- Security & compliance exposure (PCI-DSS, SOC2, ISO)

- Prior experience building or scaling SRE teams