Posted on: 20/01/2026
DevOps & Site Reliability :
- Strong understanding of DevOps and SRE principles, with a focus on reliability, scalability, and automation.
- Hands-on experience with CI/CD pipelines (build, test, deploy, rollback) using modern tooling.
- Expertise in cloud platforms (AWS / Azure / GCP) and cloud-native architectures.
- Proficiency in infrastructure as code (Terraform, CloudFormation, ARM templates).
- Experience managing containerized workloads using Docker and orchestration platforms like Kubernetes.
- Deep knowledge of system monitoring, alerting, and observability (metrics, logs, traces).
- Ability to design and maintain high-availability and fault-tolerant systems.
- Strong understanding of Linux systems, networking, and security best practices.
- Experience with incident management, root cause analysis (RCA), and postmortems.
- Ability to collaborate closely with engineering, product, and security teams.
Key Result Areas (KRAs) :
1. Platform Reliability & Availability
- Own and improve system uptime, SLA/SLO adherence, and service health across environments.
- Proactively identify reliability risks and implement preventive measures.
- Reduce P95/P99 latency, error rates, and system bottlenecks.
2. Automation & Efficiency
- Automate infrastructure provisioning, deployments, scaling, and recovery processes.
- Minimize manual intervention through self-healing systems and automation.
- Continuously improve deployment frequency while reducing failure rates.
3. Monitoring, Alerting & Observability
- Build and maintain effective monitoring and alerting frameworks.
- Ensure actionable alerts with low noise and high signal.
- Enable teams with dashboards and insights to understand system behavior.
4. Incident Management & Response
- Lead or support production incident response, ensuring rapid mitigation.
- Drive structured root cause analysis and ensure learnings translate into system improvements.
- Maintain and improve incident runbooks and on-call readiness.
5. Scalability & Performance
- Design systems that scale efficiently with business growth.
- Conduct load testing and capacity planning to support peak traffic.
- Continuously optimize infrastructure cost without compromising reliability.
Did you find something suspicious?
Posted by
Posted in
DevOps / SRE
Functional Area
Site Reliability Engineering
Job Code
1603653