HamburgerMenu
hirist

Team Lead - Site Reliability

VENZO TECHNOLOGIES PRIVATE LIMITED
8 - 12 Years
Chennai

Posted on: 11/02/2026

Job Description

Role Summary :


We are looking for a hands-on SRE Team Lead to own the reliability, scalability, and operational excellence of a cloud-native fintech platform built on microservices This role combines technical leadership, architecture ownership, and deep hands-on execution You will lead a small SRE team while remaining actively involved in design, coding, incident response, and reliability engineering


Reliability & Architecture :


- Own platform availability, latency, scalability, and resilience across environments

- Define and enforce SLOs, SLIs, error budgets, and operational KPIs

- Design and review resilience patterns : circuit breakers, retries, rate limiting, graceful degradation

- Drive chaos engineering, fault-injection, and disaster-recovery readiness


Hands-on Engineering :


- Actively contribute code (Java / Node) for reliability tooling

- Platform automation

- Observability integrations

- Review microservice architecture with engineering teams to eliminate single points of failure


Cloud & DevOps Leadership :


- Own AWS architecture (VPCs, IAM, EKS, RDS, ALB/NLB, autoscaling)

- Drive Kubernetes best practices (resource tuning, HPA, pod disruption budgets)

- Improve CI/CD pipelines for reliability, speed, and safety


Incident & Operations :


- Lead production incident response, root cause analysis (RCA), and postmortems

- Establish blameless postmortem culture

- Reduce MTTR through automation and better observability

- Participate in escalation/on-call strategy (not firefighting 247)


People & Process :

- Mentor SRE DevOps and SRE Full-Stack engineers

- Define operational standards, runbooks, and SRE practices

- Work closely with product, security, and engineering leaders



Required Skills & Experience :


- 8+ years of experience in SRE / Platform / DevOps engineering

- Strong hands-on experience with AWS (EKS, EC2, RDS, IAM, CloudWatch, ALB)

- Kubernetes & Docker

- Microservices architectures

- Strong programming background in Java and/or Node.js

- Deep understanding of distributed systems, production debugging, and capacity planning

- Experience in fintech or regulated environments is a strong plus


Nice to Have :


- Experience with chaos engineering tools

- Security & compliance exposure (PCI-DSS, SOC2, ISO)

- Prior experience building or scaling SRE teams


info-icon

Did you find something suspicious?

Similar jobs that you might be interested in