Posted on: 27/02/2026
Description :
Role :
We are seeking a proactive and detail-oriented Reliability Operations Engineer to support the uptime, monitoring, and operational stability of our critical internet-facing web applications and transaction workflows.
This role is responsible for real-time production monitoring, early anomaly detection, and first-level incident response to ensure seamless application performance and near 100% availability.
The ideal candidate will have hands-on experience with monitoring tools (ELK Stack, Grafana), strong understanding of API workflows, and the ability to analyze logs and system metrics to identify and escalate issues promptly.
Responsibilities :
- Production Monitoring and Uptime Management
- Perform 24x7 monitoring of critical internet web applications and system workflows
- Ensure high system availability and proactively detect service disruptions
- Monitor API performance, latency, error rates, and transaction flows
- Track server health parameters including CPU, memory, disk utilization, and network stability
- Monitor Elasticsearch cluster health, index growth, and log ingestion pipelines
Incident Detection & Escalation :
- Identify incidents (P1 / P2 / P3) and raise timely alerts to application and infrastructure teams
- Perform first-level triage to determine whether issues are application, database, infrastructure, or external dependency related
- Maintain incident logs and support Root Cause Analysis (RCA) documentation
- Follow defined escalation matrix and SLA guidelines
Application Workflow Validation :
- Understand end-to-end application workflows and business transactions
- Review API request/response payloads and validate error conditions
- Distinguish between 4xx and 5xx errors and identify recurring failure patterns
- Support validation of integration points with external systems and APIs
Observability & Dashboard Management :
- Create and maintain real-time dashboards using Kibana and Grafana
- Monitor File beat and Logstash log ingestion processes
- Configure alerts and threshold-based monitoring for proactive incident detection
- Analyze system logs and application metrics to identify trends and anomalies
Continuous Monitoring Improvements :
- Suggest enhancements to monitoring metrics and dashboards
- Reduce false positives and improve alert quality
- Contribute to automation of monitoring tasks where feasible
Experience :
- 25 years of experience in Production Support, NOC, SRE Support, or Application Monitoring roles
- Hands-on experience working in 24x7 monitoring environments
- Experience supporting high-availability internet-based applications preferred
Skills :
- Application and Workflow Understanding
- Strong understanding of HTTP/REST APIs and JSON payload structures
- Knowledge of API status codes and error classification
- Ability to trace application workflows across services
Monitoring and Observability :
- Hands-on experience with ELK Stack (Elasticsearch, Logstash, File beat, Kibana)
- Monitoring Elasticsearch cluster health and index size management
- Experience creating dashboards in Kibana and/or Grafana
- Ability to configure monitoring alerts and thresholds
Infrastructure and Systems Knowledge :
- Basic Linux command-line proficiency (top, df -h, grep, tail, curl, netstat, etc.)
- Understanding of server health metrics and system performance indicators
- Basic knowledge of databases and log analysis
Incident Management :
- Ability to quickly assess and classify production issues
- Strong analytical and troubleshooting skills
- Familiarity with ticketing systems such as Jira or ServiceNow
Soft Skills :
- Strong analytical and problem-solving capabilities
- Ability to remain calm under pressure during high-severity incidents
- Clear verbal and written communication skills
- Strong attention to detail
- Willingness to work rotational shifts including nights and weekends
Qualifications :
- Bachelors degree in Computer Science, Information Technology, or related field
- Certifications in Linux, Cloud, or Monitoring tools are an added advantage
Did you find something suspicious?
Posted by
Posted in
DevOps / SRE
Functional Area
Other Software Development
Job Code
1616829