HamburgerMenu
hirist

Status Neo - Senior Observability Engineer - Kubernetes

Posted on: 07/11/2025

Job Description

Description :

Title : Senior Observability Engineer

Experience : 7+ years

Location : Remote

Type : Full-time

Job Description :

About the role :

We want one observability standard across the stack. Today : some services emit metrics, some dont; frontend isnt fully measured; infra/GitOps/app alerts all land in the same place. You will design and roll out the observability blueprint browser ? frontend (React/Next.js/RUM) ? API/GraphQL ? Kubernetes ? alert routing so we can see real user performance, track P95/P99, keep SLOs for critical envs, and send the right alerts to the right teams.

Must-have :

- 8+ years in Observability / SRE / Platform / Backend with production systems.

- Strong Kubernetes experience (agents/DaemonSets, metadata enrichment).

- OpenTelemetry hands-on for Go, Node.js, and browser knows what to add to code, not just turn on APM.

- RUM experience in Datadog (Preferred) / Grafana / Dynatrace and can set up FE ? BE correlation.

- Solid Prometheus/Grafana or Datadog skills (histograms, recording rules, alerting rules).

- Proven track record of reducing alert noise and standardizing SLOs across teams.

What you will do :

- Enable RUM & frontend metrics for React/Next.js (Web Vitals, SPA nav, network timing, JS errors) and link them to backend traces.

- Instrument API/GraphQL calls so user actions can be traced down to slow endpoints/services.

- Define SLO/SLI templates (latency, availability, error rate) per environment/client.

- Design alert strategy : severity levels (P1/P2/P3), who gets what (infra vs app vs GitOps), Slack/Teams routing, and escalation.

- Standardize observability across the stack :

- Standard labels/tags (service, env, version, tenant)

- Standard dashboards per service (traffic, errors, latency, saturation)

- Standard alert rules (burn-rate, error spike, high latency)

- Standard OTel/Alloy collector config checked into Git

- Run OTel / Grafana Alloy pipelines for metrics, logs, traces with sampling and cardinality controls.

- Ship golden dashboards for : Frontend UX, API performance, Service/K8s health.

- Keep it GitOps : dashboards, alert rules, and collector configs as code (Helm/Kustomize/Terraform).


info-icon

Did you find something suspicious?