Posted on: 27/10/2025
Description :
ROLE : Senior Engineer/Lead - Scale & Performance AI-Powered Observability Platform
LOCATION : Bangalore (Onsite)
EXPERIENCE : 612 years
FUNCTION : Engineering Platform / DevOps / SRE
About the Role :
- Design and execute load, stress, soak, and capacity tests for microservices, agents, and ingestion layers.
- Identify performance bottlenecks across infrastructure (CPU/memory/IO) and application layers (API latency, throughput, GC behavior, etc.).
- Develop and maintain performance test frameworks and benchmarking environments (preferably Kubernetes-based).
- Partner with DevOps and SRE teams to tune system configurations (Kubernetes, Postgres/TimescaleDB, ClickHouse, Kafka, etc.) for scale.
- Instrument services with OpenTelemetry metrics and traces to measure system health and latency distribution (p50/p95/p99).
- Contribute to capacity planning, horizontal/vertical scaling strategy, and resource optimization.
- Analyze production incidents for scale-related root causes and drive permanent fixes.
- Collaborate with engineering teams to design scalable architecture patterns and define SLIs/SLOs for system performance.
- Document performance baselines, tuning guides, and scalability best practices.
Skills & Qualifications :
- Strong background in performance engineering for large-scale distributed systems or SaaS platforms.
- Expertise in Kubernetes, container runtime (container/Docker), and resource profiling in containerized environments.
- Solid knowledge of Linux internals, CPU/memory profiling, and network stack tuning.
- Hands-on experience with observability tools Prometheus, Grafana, OpenTelemetry, Jaeger, Loki, Tempo, etc.
- Familiarity with datastores used in observability platforms e.g., ClickHouse, PostgreSQL/TimescaleDB, Elasticsearch, or Cassandra.
- Experience building and executing performance benchmarking frameworks using tools like k6, Locust, JMeter, or custom Golang/Python scripts.
- Ability to interpret metrics (CPU usage, memory, GC, latency) and correlate across systems.
- Strong analytical and problem-solving skills with an automation-first mindset.
Good-to-Have :
- Experience with agent benchmarking (OpenTelemetry Collector, custom data shippers).
- Exposure to streaming systems like Kafka, NATS, or Pulsar.
- Experience with CI/CD pipelines for performance testing and regression tracking.
- Understanding of cost optimization and capacity forecasting in cloud environments (AWS/GCP/Azure).
- Proficiency in Go, Python, or Bash scripting for automation and data analysis.
Did you find something suspicious?
Posted By
Posted in
Platform Engineering / SAP/Oracle
Functional Area
DevOps / Cloud
Job Code
1565083
Interview Questions for you
View All