About the Client :

The organization is a mission-driven technology company focused on creating safer communities through innovative school bus safety solutions across North America.

By leveraging AI-powered cameras and cloud-connected systems, it helps school districts monitor and enforce traffic laws around school buses, protecting students as they travel to and from school.

The company partners closely with local governments, law enforcement, and transportation agencies to deliver end-to-end programs that include installation, monitoring, violation processing, and public education all at no cost to taxpayers.

About the role :

At client, they're scaling a mission- critical safety and automation platform as we evolve from a monolith into distributed, event- driven and microservice-based systems. Reliability, latency, and operational efficiency are foundational-not afterthoughts.

We're seeking a Site Reliability Engineer (SRE) who blends software engineering discipline with systems, infrastructure, and observability expertise. You'll own availability, performance, scalability, and production readiness across services-driving automation, reducing toil, and enabling fast, safe delivery.

This role is for someone who wants to shape a modern reliability culture while protecting a platform that directly advances road safety through real-time data, analytics, and AI.

Key Responsibilities :

- Define, implement, and iterate SLIs/SLOs (latency, availability, errors, saturation); operationalize error budgets and trigger corrective action.

- Engineer end- to- end observability (metrics, logs, traces, events) leveraging Datadog to accelerate detection and root cause analysis.

- Automate infrastructure (Terraform), deployment workflows, self- healing mechanisms, and progressive delivery (canary / blue- green).

- Lead incident lifecycle: detection, triage, mitigation, coordination, communication, and high-quality post- incident reviews that drive systemic fixes.

- Build and optimize CI/CD pipelines (GitHub Actions or equivalent) with reliability, rollback safety, and change quality controls.

- Perform capacity & performance engineering: load modeling, autoscaling policies, cost/efficiency tuning.

- Reduce toil via tooling, runbooks, proactive failure analysis, chaos / fault injection (AWS FIS or similar).

- Partner with development teams on architectural reviews, production readiness (operability, resilience, security, observability).

- Enforce least- privilege, secrets management , and infrastructure security; integrate policy as code.

- Improve alert quality (noise reduction, actionable context) to lower MTTR and fatigue.

- Champion reliability patterns: backpressure, graceful degradation,, circuit breaking

- Support distributed systems debugging (timeouts, partial failures, consistency anomalies) with emphasis on AI.

- Contribute to governance of change management, deployment health gates, and release safety.

- Document playbooks, escalation paths, and evolving reliability standards.

- Treat reliability as a product: roadmap, KPIs, stakeholder alignment, continuous improvement.

Preferred Qualifications :

- 3+ years in SRE / Production Engineering / DevOps

- Proficient in one or more: Go, Python, TypeScript/Node.js, or Ruby for automation, tooling, and services.

- Strong Linux internals and networking fundamentals (DNS, TLS, HTTP, routing, load balancing).

- Hands-on Infrastructure as Code (Terraform) and GitOps workflows.

- Containers & orchestration (AWS ECS) including resource tuning & scaling strategies.

- Production-grade observability: Prometheus, Grafana, OpenTelemetry, ELK, Datadog (preferred).

- CI/CD design (pipelines, promotion strategies, automated verification, rollout / rollback).

- Full incident management lifecycle & quantitative postmortem practices.

- Experience with distributed systems failure modes (latency spikes, retry storms, thundering herds).

- Chaos / fault injection frameworks (AWS FIS preferred).

- Performance / load testing (k6, Locust, Gatling) and profiling for bottleneck isolation.

- BS/MS in Computer Science, Engineering, or equivalent practical expertise.

Mindset & Behaviors :

- Bias for automation and measurable reliability outcomes.

- Calm, clear communicator under pressure; drives clarity during ambiguity.

- Sees reliability as a product with customers, SLAs, and iteration cycles.

- Data-driven; prefers leading indicators over reactive firefighting.

- Raises the bar for operational excellence and shared ownership.

Why Join Us :

- Shape quality & reliability strategy for a modern, mission-driven safety platform.

- Direct impact: your work protects communities and improves public safety outcomes.

- Work across observability, distributed systems, infrastructure automation, and high-velocity delivery.

- Influence engineering culture: shift-left reliability, proactive resilience, sustainable on-call.

- Collaborate with teams modernizing architecture (microservices, event streaming, serverless, edge).

- Leverage advanced tooling (Datadog, Terraform, progressive delivery frameworks).

- Join a culture focused on learning loops, autonomy, and meaningful impact.