Posted on: 07/11/2025
Description :
Title : Senior Observability Engineer
Experience : 7+ years
Location : Remote
Type : Full-time
Job Description :
About the role :
We want one observability standard across the stack. Today : some services emit metrics, some dont; frontend isnt fully measured; infra/GitOps/app alerts all land in the same place. You will design and roll out the observability blueprint browser ? frontend (React/Next.js/RUM) ? API/GraphQL ? Kubernetes ? alert routing so we can see real user performance, track P95/P99, keep SLOs for critical envs, and send the right alerts to the right teams.
Must-have :
- 8+ years in Observability / SRE / Platform / Backend with production systems.
- Strong Kubernetes experience (agents/DaemonSets, metadata enrichment).
- OpenTelemetry hands-on for Go, Node.js, and browser knows what to add to code, not just turn on APM.
- RUM experience in Datadog (Preferred) / Grafana / Dynatrace and can set up FE ? BE correlation.
- Solid Prometheus/Grafana or Datadog skills (histograms, recording rules, alerting rules).
- Proven track record of reducing alert noise and standardizing SLOs across teams.
What you will do :
- Enable RUM & frontend metrics for React/Next.js (Web Vitals, SPA nav, network timing, JS errors) and link them to backend traces.
- Instrument API/GraphQL calls so user actions can be traced down to slow endpoints/services.
- Define SLO/SLI templates (latency, availability, error rate) per environment/client.
- Design alert strategy : severity levels (P1/P2/P3), who gets what (infra vs app vs GitOps), Slack/Teams routing, and escalation.
- Standardize observability across the stack :
- Standard labels/tags (service, env, version, tenant)
- Standard dashboards per service (traffic, errors, latency, saturation)
- Standard alert rules (burn-rate, error spike, high latency)
- Standard OTel/Alloy collector config checked into Git
- Run OTel / Grafana Alloy pipelines for metrics, logs, traces with sampling and cardinality controls.
- Ship golden dashboards for : Frontend UX, API performance, Service/K8s health.
- Keep it GitOps : dashboards, alert rules, and collector configs as code (Helm/Kustomize/Terraform).
Did you find something suspicious?
Posted By
Mohammed Rawoof
Sr. Talent Analyst at StatusNeo Technology Consulting Pvt. Ltd
Last Active: 4 Dec 2025
Posted in
DevOps / SRE
Functional Area
Site Reliability Engineering
Job Code
1570379
Interview Questions for you
View All