Posted on: 18/02/2026
Description :
Banyan Software provides the best permanent home for successful enterprise software companies, their employees, and customers. We are on a mission to acquire, build and grow great enterprise software businesses all over the world that have dominant positions in niche vertical markets. In recent years, Banyan was named the #1 fastest-growing private software company in the US on the Inc. 5000 and amongst the top 10 fastest-growing companies by the Deloitte Technology Fast 500. Founded in 2016 with a permanent capital base setup to preserve the legacy of founders, Banyan focuses on a buy and hold for life strategy for growing software companies that serve specialized vertical markets.
Role : Senior Site Reliability Engineer (SRE) & Support Lead (Touchstream)
Location : Chennai, India
Reports to : Head of Integrations
Role Type : Hands-on senior individual contributor with support leadership responsibilities
Company & Core Product Snapshot
Touchstream is the OTT Operations Hub : a cloud-native SaaS platform for independent, end-to-end monitoring of streaming video systems (CDNs, origin, delivery chain). We serve some of the worlds largest broadcasters, telco/OTT services, and streaming platforms - monitoring tens of thousands of live streams in real time.
Touchstream now unifies its best selling CDN Monitoring and VirtualNOC into a single platform delivering :
- Unified data & end-to-end visibility across the streaming workflow
- Best-in-class incident intelligence and RCA tooling (including timestamped evidence packs)
- Operating-model improvements via shared views, collaboration, AI MCP Servers and rich knowledge bases
- Business value and ROI reporting for capacity optimization and performance insights
Role Summary :
As Senior SRE Engineer & Support Lead, you will own production health for Touchstreams customer-facing platform and data plane, while also leading the global technical support function as part of your SRE responsibilities.
Your mission is twofold :
1. Reliability ownership :
- Ensure high availability, performance, and change safety across the system (UI/API and ingest, process & query pipelines), with strong SLO discipline and continuous improvement.
2. Support leadership :
- Run and evolve the support operation triage, escalation, incident response coordination, tooling, and (over time) building a strong support team in Chennai to deliver world-class customer outcomes.
- This is a highly impactful role at the intersection of SRE, incident management, observability engineering, and customer-facing support.
Responsibilities :
- Define and maintain SLOs, error budgets, and service health reporting.
- Own availability and performance of :
i. Customer-facing system : UI/API
ii. Data plane : ingest, process & query pipelines
- Drive capacity planning for live-event spikes, load testing, and scaling strategies.
- Prevent recurring issues through high-quality RCAs and rigorous follow-through.
2) On-Call & Incident Management (Run the Room) :
- Build and evolve the on-call operating model : severity levels, paging rules, escalation paths, comms templates.
- Track MTTA/MTTR and implement systemic improvements over time.
3) Observability for the Observability Platform (Meta-Observability) :
- Own who watches the watcher? - monitoring and alerting for Touchstreams monitoring pipeline itself.
- Standardize telemetry conventions (logs/metrics/traces) across services.
- Build and maintain dashboards for :
i. Ingest health (per customer / per source)
ii. Pipeline lag
iii. Query performance
iv. Alerting health
- Tune alerting to reduce noise : dedupe, routing, symptom vs cause, threshold hygiene.
4) Release Engineering & Change Safety (Bulletproof Change Management) :
- Implement guardrails : feature flags, progressive delivery/canaries, automated rollback triggers.
- Maintain release readiness practices : migration checks, backfills, customer impact assessment, capacity impacts.
- Drive change metrics : deploy frequency, change failure rate, recovery time from deploys.
5) Cost & Efficiency Ownership (Cloud Economics) :
- Monitor and optimize cost per GB ingested/stored/queried.
- Enforce retention policies, tiering, sampling, and query limits without breaking customer value.
- Make explicit capacity vs. cost tradeoffs - especially around large live events and heavy dashboards.
6) Security & Resilience Basics (Small-Team Practicality) :
- Baseline controls : Access reviews, secrets management, least privilege, dependency scanning.
- Backup/restore and lightweight-but-real disaster recovery drills.
7) Support Leadership & Operations (Explicitly Part of the Role) :
- Serve as the senior escalation point for critical customer issues and high-impact outages.
- Senior Technical Support Manage :
- Own the support operating model :
i. Ticket triage, prioritization, SLAs, escalation paths, and shift handovers
ii. Runbooks, playbooks, FAQs, and knowledge base (including formats suitable for AI-assisted support / RAG)
- Establish and monitor support KPIs (SLA compliance, backlog, customer satisfaction, MTTx) and implement process improvements.
Senior Technical Support Manager :
- Partner with Engineering/Product/Integrations to turn support learnings into reliability fixes and product improvements.
- Over time : help build, mentor, and lead a team of support/NOC engineers in Chennai.
8) Customer-Impact Focus (Tenant Health & Trust) :
- Maintain per-tenant customer health views : SLO compliance, noisy sources, top offenders, recurring incident patterns.
- Collaborate with Product on operator workflows : service health panels, incident summaries, status updates.
Required Qualifications & Skills :
Technical / SRE Foundation :
- 8+ years in SRE, production operations, technical support for SaaS, or NOC/ops roles with strong reliability ownership.
- Strong understanding of cloud infrastructure (AWS and/or GCP) and service operations.
- Experience with monitoring/alerting/logging stacks, incident management, and RCA practices.
- Ability to automate operational work (Python and/or shell scripting); comfort with APIs and CLI tooling.
Streaming / OTT Domain (Nice to Have) :
- Strong understanding of video streaming and delivery concepts : HLS, DASH, CMAF, ABR, CDNs, origin, HTTP, caching, DNS, SSL/TLS.
- Familiarity with AWS Media Services is a big plus.
Support Leadership & Customer Communication :
- Proven ability to run escalations and communicate clearly in high-pressure incidents.
- Experience designing support workflows, SLAs, escalation paths, and operational KPIs.
- Strong written and verbal English; confidence presenting incident status and RCAs to customers.
Working Style :
- Comfortable with flexible hours to support global customers (overlap with Europe/US time zones as needed).
- Bias for action, continuous improvement mindset, and strong ownership.
Desired / Nice-to-Have :
- Prior experience supporting high-scale, always-on streaming events and live operations.
- Experience with progressive delivery, canarying, feature-flag platforms, and release automation.
- Familiarity with IT service management frameworks (e.g., ITIL).
- Security operations exposure (secrets management, vulnerability management, audit logging).
What You'll Gain & Why Join :
- A senior, high-ownership role shaping reliability + support for a mission-critical observability platform in OTT streaming.
- Direct impact on global broadcasters and streaming services - improving viewer experience at scale.
- Opportunity to build the SRE/support operating model and grow the Chennai support function over time.
- Collaboration with a globally distributed team across engineering, integrations, operations, and product.
Did you find something suspicious?
Posted by
Posted in
DevOps / SRE
Functional Area
Site Reliability Engineering
Job Code
1613925