Description :
We are seeking a Senior Observability Engineer with strong expertise in designing, implementing, and optimising observability solutions. In this role, you will be key to shaping the future of observability at Cognite, assessing existing observability frameworks, identifying gaps, and building robust capabilities encompassing log aggregation, event correlation, noise reduction, and comprehensive telemetry analysis to enable proactive operational excellence and reliability for our services.
Responsibilities :
- Conduct assessments of existing observability architectures to identify gaps and improvement opportunities.
- Design and implement scalable log aggregation pipelines for centralised and efficient data collection.
- Apply noise-reduction techniques to filter irrelevant or false-positive alerts, enhancing focus on actionable issues.
- Develop and maintain monitoring dashboards that deliver actionable insights across applications and infrastructure.
- Lead the migration from Lightstep to Honeycomb, ensuring seamless data pipeline transitions, OpenTelemetry alignment, and stakeholder adoption.
- Collaborate with infrastructure and product teams to integrate observability tooling into CI/CD workflows and cloud environments.
- Analyse telemetry data (metrics, logs, traces) to troubleshoot complex system behaviours and recommend improvements.
- Participate in production debugging and incident troubleshooting using telemetry data
- Mentor junior engineers on log management, event correlation, distributed tracing, and alert management.
- Stay current on observability innovations and recommend adoption strategies aligned with organisational goals.
- Support post-incident reviews and continuous improvement through data-driven root cause analysis.
- Drive continuous improvement in reliability and operational excellence through proactive observability initiatives.
Requirements:
- 6+ years of experience in software or systems engineering, with at least 3 years focused on observability or SRE practices.
- Hands-on experience with observability tools such as Honeycomb, VictoriaMetrics, Lightstep, Prometheus, Grafana, OpenTelemetry, Splunk, Datadog, or New Relic.
- Strong knowledge of OpenTelemetry instrumentation (metrics, traces, logs) and SLIs/SLOs for reliability tracking.
- Experience with distributed tracing, event correlation, and noise reduction frameworks.
- Proficiency in one or more programming/scripting languages such as Python, Java, Kotlin, Go, or Shell.
- Working knowledge of Infrastructure as Code (Terraform) and CI/CD (Jenkins, Github Actions,. ) pipelines.
- Familiarity with cloud platforms (AWS, Azure, GCP) and container orchestration (Kubernetes).
- Strong analytical, troubleshooting, and communication skills with the ability to work effectively across teams.
- Experience conducting observability gap assessments and defining improvement plans.
- Experience working in complex or multi-cloud environments is preferred.
Did you find something suspicious?
Posted by
Posted in
DevOps / SRE
Functional Area
Site Reliability Engineering
Job Code
1621727