HamburgerMenu
hirist

Job Description

Description :

Role Overview :


We are seeking a Site Reliability Engineer who blends software engineering with systems, infrastructure, and observability expertise. You will own availability, performance, scalability, and production readiness across services, driving automation, reducing toil, and enabling fast, safe delivery. Your work will directly impact public safety by supporting a mission-critical platform.

Key Responsibilities :

- Define and implement SLIs/SLOs; operationalize error budgets and trigger corrective actions.

- Engineer end-to-end observability using Datadog to accelerate detection and root cause analysis.

- Automate infrastructure (Terraform), deployment workflows, and self-healing mechanisms.

- Lead incident lifecycle : detection, triage, mitigation, and post-incident reviews.

- Build and optimize CI/CD pipelines with reliability and rollback safety.

- Partner with development teams on architectural reviews, production readiness, and security.

- Champion reliability patterns, performance tuning, and proactive failure analysis.

Preferred Qualifications :

- 5+ years in SRE, Production Engineering, or DevOps.

- Proficient in Go, Python, TypeScript/Node.js, or Ruby.

- Strong Linux internals, networking fundamentals, and cloud infrastructure experience (AWS).

- Hands-on with Terraform, GitOps, containers, orchestration, observability tools (Datadog, Prometheus, etc.).

- Experience with distributed systems, CI/CD, incident management, and chaos/fault injection.

What We Offer :


- Shape quality and reliability strategy for a mission-driven platform.

- Work with modern infrastructure, observability, distributed systems, and AI-driven systems.

- Collaborate with teams modernizing architecture (microservices, serverless, event streaming).

- Flexible and collaborative work culture focused on learning, autonomy, and impact.


info-icon

Did you find something suspicious?

Similar jobs that you might be interested in