HamburgerMenu
hirist

Almabase - Tech Lead - DevOps & Site Reliability

Almabase
Others
3 - 5 Years
star-icon
3.8white-divider11+ Reviews

Posted on: 20/01/2026

Job Description

DevOps & Site Reliability :


- Strong understanding of DevOps and SRE principles, with a focus on reliability, scalability, and automation.


- Hands-on experience with CI/CD pipelines (build, test, deploy, rollback) using modern tooling.


- Expertise in cloud platforms (AWS / Azure / GCP) and cloud-native architectures.


- Proficiency in infrastructure as code (Terraform, CloudFormation, ARM templates).


- Experience managing containerized workloads using Docker and orchestration platforms like Kubernetes.


- Deep knowledge of system monitoring, alerting, and observability (metrics, logs, traces).


- Ability to design and maintain high-availability and fault-tolerant systems.


- Strong understanding of Linux systems, networking, and security best practices.


- Experience with incident management, root cause analysis (RCA), and postmortems.


- Ability to collaborate closely with engineering, product, and security teams.


Key Result Areas (KRAs) :


1. Platform Reliability & Availability


- Own and improve system uptime, SLA/SLO adherence, and service health across environments.


- Proactively identify reliability risks and implement preventive measures.


- Reduce P95/P99 latency, error rates, and system bottlenecks.


2. Automation & Efficiency


- Automate infrastructure provisioning, deployments, scaling, and recovery processes.


- Minimize manual intervention through self-healing systems and automation.


- Continuously improve deployment frequency while reducing failure rates.


3. Monitoring, Alerting & Observability


- Build and maintain effective monitoring and alerting frameworks.


- Ensure actionable alerts with low noise and high signal.


- Enable teams with dashboards and insights to understand system behavior.


4. Incident Management & Response


- Lead or support production incident response, ensuring rapid mitigation.


- Drive structured root cause analysis and ensure learnings translate into system improvements.


- Maintain and improve incident runbooks and on-call readiness.


5. Scalability & Performance


- Design systems that scale efficiently with business growth.


- Conduct load testing and capacity planning to support peak traffic.


- Continuously optimize infrastructure cost without compromising reliability.

info-icon

Did you find something suspicious?

Similar jobs that you might be interested in