HamburgerMenu
hirist

BT Global - Site Reliability Engineer

BT E SERV INDIA PRIVATE LIMITED
7 - 15 Years
Bangalore

Posted on: 05/03/2026

Job Description

Women Candidates Preferred


About the job :


BT Digital is looking for a hands-on SRE Specialist who thrives in high-scale, cloud-native, automation-driven environments. If reliability engineering, AI-assisted operations, and next-gen observability excite you this role is for you.

What youll do :

- Build and operate CI/CD, IaC, GitOps, and containerized automation following strong engineering and security practices.

- Partner with delivery/QA to plan and execute efficient, high-quality testing.

- Automate repetitive tasks and MTTR reduction through scripts, runbooks, and safe-guarded workflows.

- Handle Tier-2/3 incidents diagnose, recover, document, and drive preventive fixes.

- Improve observability refine metrics, logs, traces, dashboards, and alerts for better signal and less noise.

- Apply SRE principles define SLIs/SLOs, manage error budgets, and recommend data-driven reliability improvements.

- Strengthen release reliability maintain stable pipelines and eliminate configuration drift.

- Support on-call readiness with solid runbooks, rollback safety, DR drills, and game-day participation.

- Identify and escalate reliability risks across systems and environments.

- Collaborate across engineering teams and document clearly for shared learning.

Skills required for the job :

Core SRE & Engineering Skills :

- Expertise in end-to-end observability/monitoring tools (e.g., Dynatrace) to assess system health and performance.

- Proficiency in programming (Java, Python) for building automation and tooling.

- Hands-on experience with cloud platforms (AWS/Azure/GCP) and managing distributed systems.

- Strong knowledge of software architecture, design patterns, and microservices.

- Practical experience with CI/CD, DevOps practices, and continuous testing.

- Skilled in infrastructure-as-code and pipeline automation for scalable, secure deployments.

Reliability, Operations & Continuous Improvement :


- Ability to apply SRE principles automation, reducing toil, incident learning, and reliability improvements.


- Experience diagnosing issues in complex, distributed systems.

- Comfortable supporting 24x7 operational environments and managing high-priority incidents.

- Strong analytical and reporting skills to communicate insights and risks.

- Focus on process improvement using data and automation.

- Adaptability to new technologies with a continuous learning mindset.

AI-Driven Observability & AIOps :

- Understanding of AIOps fundamentals telemetry ingestion, event correlation, topology modeling, and automated remediation.

- Experience with AI-assisted observability for anomaly detection and faster incident resolution.

- Ability to design AI-driven, context-aware alerting to reduce noise and improve prioritization.

Experience you would be expected to have :

Must have a minimum of 7+ years of work experience :

- Incident Response & Operations: Handle Tier 2/3 incidents, drive quick recovery, perform RCA, and enhance runbooks.

- SRE Fundamentals: Apply SLIs/SLOs, reduce toil through automation, support on-call, and highlight reliability risks early.

- Observability Expertise: Use metrics/logs/traces, optimise dashboards/alerts, and ensure end-to-end monitoring.

- Automation & CI/CD: Maintain reliable pipelines, use IaC to prevent drift, and create automation to speed deployments and recovery.

- Resilience, Governance & Collaboration: Contribute to DR/game days, follow safe change practices, measure reliability outcomes, and collaborate across teams while continuously learning.


The job is for:

Women candidates preferred
For women joining back the workforce
info-icon

Did you find something suspicious?

Similar jobs that you might be interested in