HamburgerMenu
hirist

Reliability Operations Engineer - ELK Stack

Samporna People Network
2 - 5 Years
Mumbai

Posted on: 27/02/2026

Job Description

Description :

Role :


We are seeking a proactive and detail-oriented Reliability Operations Engineer to support the uptime, monitoring, and operational stability of our critical internet-facing web applications and transaction workflows.

This role is responsible for real-time production monitoring, early anomaly detection, and first-level incident response to ensure seamless application performance and near 100% availability.

The ideal candidate will have hands-on experience with monitoring tools (ELK Stack, Grafana), strong understanding of API workflows, and the ability to analyze logs and system metrics to identify and escalate issues promptly.

Responsibilities :

- Production Monitoring and Uptime Management

- Perform 24x7 monitoring of critical internet web applications and system workflows

- Ensure high system availability and proactively detect service disruptions

- Monitor API performance, latency, error rates, and transaction flows

- Track server health parameters including CPU, memory, disk utilization, and network stability

- Monitor Elasticsearch cluster health, index growth, and log ingestion pipelines

Incident Detection & Escalation :

- Identify incidents (P1 / P2 / P3) and raise timely alerts to application and infrastructure teams

- Perform first-level triage to determine whether issues are application, database, infrastructure, or external dependency related

- Maintain incident logs and support Root Cause Analysis (RCA) documentation

- Follow defined escalation matrix and SLA guidelines

Application Workflow Validation :

- Understand end-to-end application workflows and business transactions

- Review API request/response payloads and validate error conditions

- Distinguish between 4xx and 5xx errors and identify recurring failure patterns

- Support validation of integration points with external systems and APIs

Observability & Dashboard Management :

- Create and maintain real-time dashboards using Kibana and Grafana

- Monitor File beat and Logstash log ingestion processes

- Configure alerts and threshold-based monitoring for proactive incident detection

- Analyze system logs and application metrics to identify trends and anomalies

Continuous Monitoring Improvements :

- Suggest enhancements to monitoring metrics and dashboards

- Reduce false positives and improve alert quality

- Contribute to automation of monitoring tasks where feasible

Experience :

- 25 years of experience in Production Support, NOC, SRE Support, or Application Monitoring roles

- Hands-on experience working in 24x7 monitoring environments

- Experience supporting high-availability internet-based applications preferred

Skills :

- Application and Workflow Understanding

- Strong understanding of HTTP/REST APIs and JSON payload structures

- Knowledge of API status codes and error classification

- Ability to trace application workflows across services

Monitoring and Observability :


- Hands-on experience with ELK Stack (Elasticsearch, Logstash, File beat, Kibana)

- Monitoring Elasticsearch cluster health and index size management

- Experience creating dashboards in Kibana and/or Grafana

- Ability to configure monitoring alerts and thresholds

Infrastructure and Systems Knowledge :

- Basic Linux command-line proficiency (top, df -h, grep, tail, curl, netstat, etc.)

- Understanding of server health metrics and system performance indicators

- Basic knowledge of databases and log analysis

Incident Management :

- Ability to quickly assess and classify production issues

- Strong analytical and troubleshooting skills

- Familiarity with ticketing systems such as Jira or ServiceNow

Soft Skills :


- Strong analytical and problem-solving capabilities

- Ability to remain calm under pressure during high-severity incidents

- Clear verbal and written communication skills

- Strong attention to detail

- Willingness to work rotational shifts including nights and weekends

Qualifications :

- Bachelors degree in Computer Science, Information Technology, or related field

- Certifications in Linux, Cloud, or Monitoring tools are an added advantage


info-icon

Did you find something suspicious?

Similar jobs that you might be interested in