Posted on: 07/03/2026
Description :
Role Overview :
We are looking for an experienced Observability/Platform Engineer with a strong software engineering background to design, build, and operate scalable observability platforms. The ideal candidate will have hands-on experience working with distributed systems and modern observability tools to monitor, troubleshoot, and optimize production systems.
This role requires deep expertise in metrics, logs, traces, and telemetry-based debugging, along with experience working in cloud-native and Kubernetes environments.
Key Responsibilities :
- Design, build, and maintain observability services and platforms that provide visibility into distributed systems.
- Develop and maintain software components using Golang, Java, Python, or C#.
- Implement monitoring solutions for metrics, logs, traces, and events across large-scale systems.
- Build and operate observability pipelines using tools such as Prometheus, Grafana, OpenSearch/Elasticsearch, Jaeger, Tempo, and Datadog.
- Define and implement Service Level Indicators (SLIs) and Service Level Objectives (SLOs) to measure system reliability and performance.
- Troubleshoot production issues using telemetry data and observability tools.
- Optimize system performance by analyzing latency, throughput, concurrency, and memory usage.
- Collaborate with platform, infrastructure, and application teams to ensure observability is embedded across systems.
- Operate and manage services in Kubernetes and cloud environments.
- Ensure systems are scalable, resilient, and maintain high availability in production.
Required Skills & Expertise :
Programming :
- Production experience in Golang, Java, Python, or C#
- Strong software engineering fundamentals
Observability & Monitoring :
- Understanding of metrics, logs, traces, events
- Experience defining and managing SLIs and SLOs
- Experience with observability tools such as:
- Prometheus
- Grafana
- OpenSearch / Elasticsearch
- Jaeger
- Tempo
- Datadog
Systems & Infrastructure :
- Experience building or operating distributed systems
- Hands-on experience working with Kubernetes
- Experience with cloud environments
Performance & Reliability :
- Strong understanding of performance optimization
- Knowledge of concurrency and memory management
- Ability to debug and resolve production issues using telemetry
Experience & Qualifications :
- Minimum 5 years of hands-on experience building or maintaining observability services.
- Strong understanding of distributed system architecture and reliability engineering.
- Experience working in production environments with large-scale systems.
- Strong problem-solving, debugging, and troubleshooting capabilities.
Did you find something suspicious?
Posted by
Posted in
DevOps / SRE
Functional Area
DevOps / Cloud
Job Code
1618735