Posted on: 17/12/2025
Description :
- Develop and maintain observability standards, NMS tooling, dashboards, alerting frameworks, and SLOs in collaboration with product and platform teams.
- Champion best practices in instrumentation, monitoring, and incident response across engineering teams.
- Integrate and optimise observability tools (e.g., OpenTelemetry, Prometheus, Grafana, Splunk, Elastic, etc.) within the NPS ecosystem.
- Collaborate cross-functionally to ensure observability is embedded into the SDLC and CI/CD pipelines.
- Drive adoption of observability platforms through enablement, documentation, and training.
- Continuously evaluate emerging technologies and practices to evolve our observability capabilities.
Required Skills And Experience :
- Proven experience in observability, SRE, or platform engineering roles within complex, distributed environments.
- Strong hands-on expertise with telemetry tools such as OpenTelemetry, Prometheus, Grafana, Splunk, Elastic, Loki, Jaeger, or similar.
- Proficiency in at least one programming language (e.g., Python, Go, Java) and infrastructure-as-code tools (e.g., Terraform, Helm).
- Deep understanding of cloud-native architectures (Kubernetes, microservices, service meshes).
- Experience defining and managing SLOs, SLIs, and alerting strategies.
- Strong problem-solving skills and a passion for improving system reliability and developer experience.
Did you find something suspicious?
Posted by
Posted in
DevOps / SRE
Functional Area
DevOps / Cloud
Job Code
1591752
Interview Questions for you
View All