HamburgerMenu
hirist

Senior Site Reliability Engineer & Support Lead

BNYN SOFTTECH INDIA PRIVATE LIMITED
10 - 19 Years
Chennai

Posted on: 18/02/2026

Job Description

Description :

Banyan Software provides the best permanent home for successful enterprise software companies, their employees, and customers. We are on a mission to acquire, build and grow great enterprise software businesses all over the world that have dominant positions in niche vertical markets. In recent years, Banyan was named the #1 fastest-growing private software company in the US on the Inc. 5000 and amongst the top 10 fastest-growing companies by the Deloitte Technology Fast 500. Founded in 2016 with a permanent capital base setup to preserve the legacy of founders, Banyan focuses on a buy and hold for life strategy for growing software companies that serve specialized vertical markets.

Role : Senior Site Reliability Engineer (SRE) & Support Lead (Touchstream)

Location : Chennai, India

Reports to : Head of Integrations

Role Type : Hands-on senior individual contributor with support leadership responsibilities

Company & Core Product Snapshot

Touchstream is the OTT Operations Hub : a cloud-native SaaS platform for independent, end-to-end monitoring of streaming video systems (CDNs, origin, delivery chain). We serve some of the worlds largest broadcasters, telco/OTT services, and streaming platforms - monitoring tens of thousands of live streams in real time.

Touchstream now unifies its best selling CDN Monitoring and VirtualNOC into a single platform delivering :

- Unified data & end-to-end visibility across the streaming workflow


- Best-in-class incident intelligence and RCA tooling (including timestamped evidence packs)

- Operating-model improvements via shared views, collaboration, AI MCP Servers and rich knowledge bases

- Business value and ROI reporting for capacity optimization and performance insights

Role Summary :


As Senior SRE Engineer & Support Lead, you will own production health for Touchstreams customer-facing platform and data plane, while also leading the global technical support function as part of your SRE responsibilities.


Your mission is twofold :

1. Reliability ownership :


- Ensure high availability, performance, and change safety across the system (UI/API and ingest, process & query pipelines), with strong SLO discipline and continuous improvement.


2. Support leadership :


- Run and evolve the support operation triage, escalation, incident response coordination, tooling, and (over time) building a strong support team in Chennai to deliver world-class customer outcomes.

- This is a highly impactful role at the intersection of SRE, incident management, observability engineering, and customer-facing support.

Responsibilities :


1) Reliability Ownership (Primary) :


- Define and maintain SLOs, error budgets, and service health reporting.

- Own availability and performance of :

i. Customer-facing system : UI/API

ii. Data plane : ingest, process & query pipelines

- Drive capacity planning for live-event spikes, load testing, and scaling strategies.

- Prevent recurring issues through high-quality RCAs and rigorous follow-through.

2) On-Call & Incident Management (Run the Room) :


- Build and evolve the on-call operating model : severity levels, paging rules, escalation paths, comms templates.


- Lead high-severity incidents end-to-end : triage, mitigation, rollback, stop the bleeding decisions, stakeholder comms.

- Track MTTA/MTTR and implement systemic improvements over time.

3) Observability for the Observability Platform (Meta-Observability) :


- Own who watches the watcher? - monitoring and alerting for Touchstreams monitoring pipeline itself.

- Standardize telemetry conventions (logs/metrics/traces) across services.

- Build and maintain dashboards for :

i. Ingest health (per customer / per source)

ii. Pipeline lag

iii. Query performance

iv. Alerting health

- Tune alerting to reduce noise : dedupe, routing, symptom vs cause, threshold hygiene.

4) Release Engineering & Change Safety (Bulletproof Change Management) :


- Implement guardrails : feature flags, progressive delivery/canaries, automated rollback triggers.


- Maintain release readiness practices : migration checks, backfills, customer impact assessment, capacity impacts.

- Drive change metrics : deploy frequency, change failure rate, recovery time from deploys.

5) Cost & Efficiency Ownership (Cloud Economics) :


- Monitor and optimize cost per GB ingested/stored/queried.

- Enforce retention policies, tiering, sampling, and query limits without breaking customer value.

- Make explicit capacity vs. cost tradeoffs - especially around large live events and heavy dashboards.

6) Security & Resilience Basics (Small-Team Practicality) :


- Baseline controls : Access reviews, secrets management, least privilege, dependency scanning.


- Rate limiting / abuse guardrails, audit logging, security incident response readiness.

- Backup/restore and lightweight-but-real disaster recovery drills.

7) Support Leadership & Operations (Explicitly Part of the Role) :


- Serve as the senior escalation point for critical customer issues and high-impact outages.

- Senior Technical Support Manage :


- Own the support operating model :

i. Ticket triage, prioritization, SLAs, escalation paths, and shift handovers

ii. Runbooks, playbooks, FAQs, and knowledge base (including formats suitable for AI-assisted support / RAG)

- Establish and monitor support KPIs (SLA compliance, backlog, customer satisfaction, MTTx) and implement process improvements.

Senior Technical Support Manager :


- Partner with Engineering/Product/Integrations to turn support learnings into reliability fixes and product improvements.

- Over time : help build, mentor, and lead a team of support/NOC engineers in Chennai.

8) Customer-Impact Focus (Tenant Health & Trust) :


- Maintain per-tenant customer health views : SLO compliance, noisy sources, top offenders, recurring incident patterns.

- Collaborate with Product on operator workflows : service health panels, incident summaries, status updates.

Required Qualifications & Skills :


Technical / SRE Foundation :


- 8+ years in SRE, production operations, technical support for SaaS, or NOC/ops roles with strong reliability ownership.


- Strong Linux fundamentals; comfort with debugging distributed systems.

- Strong understanding of cloud infrastructure (AWS and/or GCP) and service operations.

- Experience with monitoring/alerting/logging stacks, incident management, and RCA practices.

- Ability to automate operational work (Python and/or shell scripting); comfort with APIs and CLI tooling.

Streaming / OTT Domain (Nice to Have) :


- Strong understanding of video streaming and delivery concepts : HLS, DASH, CMAF, ABR, CDNs, origin, HTTP, caching, DNS, SSL/TLS.


- Familiarity with AWS Media Services is a big plus.


Support Leadership & Customer Communication :


- Proven ability to run escalations and communicate clearly in high-pressure incidents.

- Experience designing support workflows, SLAs, escalation paths, and operational KPIs.

- Strong written and verbal English; confidence presenting incident status and RCAs to customers.

Working Style :


- Comfortable with flexible hours to support global customers (overlap with Europe/US time zones as needed).

- Bias for action, continuous improvement mindset, and strong ownership.

Desired / Nice-to-Have :


- Prior experience supporting high-scale, always-on streaming events and live operations.

- Experience with progressive delivery, canarying, feature-flag platforms, and release automation.

- Familiarity with IT service management frameworks (e.g., ITIL).

- Security operations exposure (secrets management, vulnerability management, audit logging).

What You'll Gain & Why Join :


- A senior, high-ownership role shaping reliability + support for a mission-critical observability platform in OTT streaming.

- Direct impact on global broadcasters and streaming services - improving viewer experience at scale.

- Opportunity to build the SRE/support operating model and grow the Chennai support function over time.

- Collaboration with a globally distributed team across engineering, integrations, operations, and product.


info-icon

Did you find something suspicious?

Similar jobs that you might be interested in