Posted on: 05/03/2026
Women Candidates Preferred
About the job :
- Partner with delivery/QA to plan and execute efficient, high-quality testing.
- Automate repetitive tasks and MTTR reduction through scripts, runbooks, and safe-guarded workflows.
- Handle Tier-2/3 incidents diagnose, recover, document, and drive preventive fixes.
- Improve observability refine metrics, logs, traces, dashboards, and alerts for better signal and less noise.
- Apply SRE principles define SLIs/SLOs, manage error budgets, and recommend data-driven reliability improvements.
- Strengthen release reliability maintain stable pipelines and eliminate configuration drift.
- Support on-call readiness with solid runbooks, rollback safety, DR drills, and game-day participation.
- Identify and escalate reliability risks across systems and environments.
- Collaborate across engineering teams and document clearly for shared learning.
Skills required for the job :
Core SRE & Engineering Skills :
- Expertise in end-to-end observability/monitoring tools (e.g., Dynatrace) to assess system health and performance.
- Proficiency in programming (Java, Python) for building automation and tooling.
- Hands-on experience with cloud platforms (AWS/Azure/GCP) and managing distributed systems.
- Strong knowledge of software architecture, design patterns, and microservices.
- Practical experience with CI/CD, DevOps practices, and continuous testing.
- Skilled in infrastructure-as-code and pipeline automation for scalable, secure deployments.
Reliability, Operations & Continuous Improvement :
- Ability to apply SRE principles automation, reducing toil, incident learning, and reliability improvements.
- Experience diagnosing issues in complex, distributed systems.
- Comfortable supporting 24x7 operational environments and managing high-priority incidents.
- Strong analytical and reporting skills to communicate insights and risks.
- Focus on process improvement using data and automation.
- Adaptability to new technologies with a continuous learning mindset.
AI-Driven Observability & AIOps :
- Understanding of AIOps fundamentals telemetry ingestion, event correlation, topology modeling, and automated remediation.
- Experience with AI-assisted observability for anomaly detection and faster incident resolution.
- Ability to design AI-driven, context-aware alerting to reduce noise and improve prioritization.
Experience you would be expected to have :
Must have a minimum of 7+ years of work experience :
- Incident Response & Operations: Handle Tier 2/3 incidents, drive quick recovery, perform RCA, and enhance runbooks.
- SRE Fundamentals: Apply SLIs/SLOs, reduce toil through automation, support on-call, and highlight reliability risks early.
- Observability Expertise: Use metrics/logs/traces, optimise dashboards/alerts, and ensure end-to-end monitoring.
- Automation & CI/CD: Maintain reliable pipelines, use IaC to prevent drift, and create automation to speed deployments and recovery.
- Resilience, Governance & Collaboration: Contribute to DR/game days, follow safe change practices, measure reliability outcomes, and collaborate across teams while continuously learning.
The job is for:
Did you find something suspicious?
Posted by
Posted in
DevOps / SRE
Functional Area
Site Reliability Engineering
Job Code
1618264