Location : Bangalore, Gurgaon, Pune, Mumbai, Delhi, Chennai, Hyderabad, Noida

Experience : 6 to 9 Years

Notice Period : Immediate to 15 Days

Role Overview :

We are seeking an experienced Observability Engineer to design, build, and operate the observability foundation for complex, distributed systems. This role focuses on enabling engineering teams to understand, troubleshoot, and optimize systems using high-quality metrics, logs, traces, and insights.

As an Observability Engineer, you will build the nervous system of the platformdeveloping scalable telemetry pipelines, defining standards, and empowering teams with actionable visibility. You will work across application, platform, SRE, and infrastructure teams to ensure systems are reliable, performant, cost-efficient, and debuggable at scale.

Key Roles & Responsibilities :

Observability Strategy & Architecture :

- Define and drive the organizations observability strategy, standards, and roadmap.

- Design comprehensive telemetry architectures for distributed and microservices-based systems.

- Establish best practices and guidelines for metrics, logging, and tracing.

- Evaluate, select, and standardize observability tools and platforms.

- Create reference architectures for instrumentation across multiple technology stacks.

- Partner with engineering teams to define SLIs, SLOs, and error budgets.

Instrumentation & Telemetry Engineering :

- Instrument applications and services with metrics, logs, and distributed traces.

- Implement end-to-end distributed tracing across microservices architectures.

- Deploy and configure telemetry agents, sidecars, and collectors.

- Implement OpenTelemetry standards, SDKs, and Collector pipelines.

- Build custom instrumentation libraries and SDKs across multiple languages.

- Create auto-instrumentation frameworks to reduce developer effort.

- Ensure semantic consistency and data quality across all telemetry signals.

Observability Platforms & Tooling :

- Deploy, manage, and optimize metrics platforms such as : Prometheus, Grafana, Datadog, New Relic, Dynatrace, AppDynamics

- Cloud-native platforms (AWS CloudWatch, Azure Monitor, GCP Monitoring)

- Long-term storage solutions (Thanos, Mimir, VictoriaMetrics)

- Deploy and manage logging platforms including :

- ELK Stack, Splunk, Loki, Fluentd, Sumo Logic

- Cloud-native logging (CloudWatch Logs, Azure Log Analytics, GCP Logging)

- Deploy and manage distributed tracing tools such as :

- Jaeger, Zipkin, Datadog APM, New Relic APM, Dynatrace, Lightstep

- Optimize observability platforms for performance, scalability, and cost.

Dashboards, Alerting & Incident Enablement :

- Design and build comprehensive dashboards

- Service-level dashboards with Golden Signals (latency, traffic, errors, saturation)

- Executive dashboards for SLO compliance and business KPIs

- Real-time operational and on-call dashboards

- Design intelligent alerting strategies to reduce alert fatigue.

- Implement multi-signal alert correlation, anomaly detection, and adaptive thresholds.

- Integrate with incident management tools (PagerDuty, Opsgenie, VictorOps).

- Configure alert routing, escalation policies, suppression, and maintenance windows.

- Enable self-healing automation triggered by alerts.

Logging & Trace Engineering :

- Design and implement centralized logging architectures.

- Build log ingestion, parsing, enrichment, and normalization pipelines.

- Define structured logging standards (JSON, key-value).

- Implement log sampling and retention strategies for high-volume systems.

- Create log-based metrics and alerts.

- Ensure data privacy, compliance, and retention policies are enforced.

- Implement trace sampling strategies to balance cost and visibility.

Performance Analysis & Optimization :

- Conduct deep-dive performance investigations using telemetry data.

- Identify bottlenecks, latency contributors, and error propagation paths.

- Build capacity planning models using observability data.

- Analyze resource utilization (CPU, memory, disk, network).

- Create cost attribution and optimization insights from telemetry.

- Map service dependencies and request flows across distributed systems.

Telemetry Pipelines & Cost Optimization :

- Build and optimize telemetry data pipelines (filtering, routing, transformation).

- Manage cardinality, storage costs, and data volumes effectively.

- Implement sampling, aggregation, and retention strategies.

- Ensure high data quality and completeness.

- Build export pipelines for analytics, compliance, and archival use cases.

Enablement, Automation & DevEx :

- Build self-service observability frameworks and tooling.

- Integrate observability into CI/CD pipelines (Observability-as-Code).

- Automate dashboard and alert provisioning.

- Develop APIs, plugins, and extensions for observability platforms.

- Create documentation, tutorials, templates, and best-practice guides.

- Conduct training sessions and provide observability consulting to teams.

- Participate in code reviews to validate instrumentation quality.

Required Skills & Experience :

Core Observability Expertise :

- Strong understanding of metrics types (counters, gauges, histograms, summaries).

- Deep expertise in PromQL and time-series data modeling.

- Strong knowledge of logging pipelines, parsing (Grok/Regex/JSON), and SPL.

- Deep understanding of distributed tracing concepts, context propagation, and sampling.

- Hands-on experience with OpenTelemetry specifications and implementations.

Programming & Platforms :

- Strong proficiency in Python, Go, Java, or Node.js.

- Ability to instrument and read code across multiple languages.

- Experience building custom instrumentation libraries and APIs.

- Familiarity with Kafka, Fluentd, Logstash, or similar data pipelines.

- Experience with AWS, Azure, or GCP environments.

- Strong understanding of Kubernetes and container observability.

Professional Experience :

- 6 to 9 years of experience in observability, SRE, platform engineering, or performance engineering.

- Proven experience building observability platforms at scale.

- Experience managing high-cardinality data and observability cost optimization.

- Strong troubleshooting background in complex distributed systems.

Soft Skills & Mindset :

- Strong analytical and problem-solving skills.

- Ability to explain complex observability concepts to engineers and leadership.

- Empathy for developer experience and operational pain points.

- Strong documentation, training, and enablement capabilities.

- High attention to detail and data quality.

- Curiosity-driven mindset with passion for system internals and reliability.

Certifications (Preferred) :

- Prometheus Certified Associate (PCA)

- Datadog / New Relic Observability Certifications

- AWS / Azure / GCP Observability Certifications

- Certified Kubernetes Administrator (CKA)

- OpenTelemetry certifications (when available)

Nice-to-Have Experience :

- Real User Monitoring (RUM) and frontend observability.

- Continuous profiling (Pyroscope, Google Cloud Profiler).

- Chaos engineering and observability correlation.

- ML-driven anomaly detection and predictive analytics.

- FinOps and observability cost optimization.

- eBPF-based observability tools (Pixie, Cilium).

- Contributions to open-source observability projects.

Education : Bachelors degree in Computer Science, Engineering, or a related field.

The job is for:

Women candidates preferred

Differently-abled candidates preferred

Ex-defence personnel preferred

For women joining back the workforce