Description :

ROLE : Senior Engineer/Lead - Scale & Performance AI-Powered Observability Platform

LOCATION : Bangalore (Onsite)

EXPERIENCE : 612 years

FUNCTION : Engineering Platform / DevOps / SRE

About the Role :

We are looking for a Scale & Performance Engineer to drive the performance, scalability, and reliability initiatives for our Observability Platform that ingests, processes, and visualizes large volumes of telemetry data (metrics, logs, and traces).

You will work closely with platform, SRE, frontend and backend teams to benchmark system performance, optimize bottlenecks, and ensure the platform scales efficiently to meet growing customer and data demands.

This role is ideal for engineers who are passionate about high-performance distributed systems, Kubernetes, and observability stack performance (Prometheus, OpenTelemetry, Loki, Grafana, etc.).

Key Responsibilities :

- Own performance and scalability benchmarking for the Observability Platform components ingestion pipeline, data storage, and query services.

- Design and execute load, stress, soak, and capacity tests for microservices, agents, and ingestion layers.

- Identify performance bottlenecks across infrastructure (CPU/memory/IO) and application layers (API latency, throughput, GC behavior, etc.).

- Develop and maintain performance test frameworks and benchmarking environments (preferably Kubernetes-based).

- Partner with DevOps and SRE teams to tune system configurations (Kubernetes, Postgres/TimescaleDB, ClickHouse, Kafka, etc.) for scale.

- Instrument services with OpenTelemetry metrics and traces to measure system health and latency distribution (p50/p95/p99).

- Contribute to capacity planning, horizontal/vertical scaling strategy, and resource optimization.

- Analyze production incidents for scale-related root causes and drive permanent fixes.

- Collaborate with engineering teams to design scalable architecture patterns and define SLIs/SLOs for system performance.

- Document performance baselines, tuning guides, and scalability best practices.

Skills & Qualifications :

- Strong background in performance engineering for large-scale distributed systems or SaaS platforms.

- Expertise in Kubernetes, container runtime (container/Docker), and resource profiling in containerized environments.

- Solid knowledge of Linux internals, CPU/memory profiling, and network stack tuning.

- Hands-on experience with observability tools Prometheus, Grafana, OpenTelemetry, Jaeger, Loki, Tempo, etc.

- Familiarity with datastores used in observability platforms e.g., ClickHouse, PostgreSQL/TimescaleDB, Elasticsearch, or Cassandra.

- Experience building and executing performance benchmarking frameworks using tools like k6, Locust, JMeter, or custom Golang/Python scripts.

- Ability to interpret metrics (CPU usage, memory, GC, latency) and correlate across systems.

- Strong analytical and problem-solving skills with an automation-first mindset.

Good-to-Have :

- Experience with agent benchmarking (OpenTelemetry Collector, custom data shippers).

- Exposure to streaming systems like Kafka, NATS, or Pulsar.

- Experience with CI/CD pipelines for performance testing and regression tracking.

- Understanding of cost optimization and capacity forecasting in cloud environments (AWS/GCP/Azure).

- Proficiency in Go, Python, or Bash scripting for automation and data analysis.

Did you find something suspicious?

Posted By

Gautham Kumar

Senior Manager - Talent Acquisition at Vunet Systems

Last Active: 1 Dec 2025

Job Views:
40

Applications: 5

Recruiter Actions: 5

Posted in

Platform Engineering / SAP/Oracle

Functional Area

DevOps / Cloud

Job Code

1565083

Jobs by location

Interview Questions for you

View All

How to Write Leave Application for Urgent Work: Format & Samples (2025)

Top 90+ Machine Learning Interview Questions and Answers

Top 40+ Deep Learning Interview Questions and Answers