HamburgerMenu
hirist

Equisoft - Senior Site Reliability Support Engineer

Equisoft
Hyderabad
6 - 9 Years

Posted on: 12/01/2026

Job Description

Description :


Key Responsibilities :


Site Reliability Engineering (SRE) :


- Design, implement, and maintain highly available, scalable, and fault-tolerant systems.


- Define and monitor Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Service Level Agreements (SLAs).


- Develop and maintain monitoring, alerting, logging, and observability solutions.


- Automate operational tasks to reduce toil using scripts, tools, and CI/CD pipelines.


- Participate in capacity planning, performance tuning, and system optimization.


- Lead post-incident reviews (RCA) and drive corrective and preventive actions.


- Champion reliability-first design and continuous improvement initiatives.


Production Support & Incident Management :


- Provide L2/L3 production support for business-critical applications and platforms.


- Act as an escalation point during major incidents, outages, and performance degradations.


- Lead incident response, troubleshooting, and recovery in a 24x7 production environment.


- Coordinate with cross-functional teams during incidents to ensure rapid resolution.


- Maintain runbooks, SOPs, and knowledge base documentation.


- Analyze recurring issues and implement long-term fixes to prevent reoccurrence.


DevOps & Cloud Operations :


- Manage and support cloud infrastructure (AWS / Azure / GCP).


- Work with containerization and orchestration platforms such as Docker and Kubernetes.


- Support and enhance CI/CD pipelines for reliable and repeatable deployments.


- Implement Infrastructure as Code (IaC) using tools like Terraform, CloudFormation, or ARM.


- Ensure backup, disaster recovery, and business continuity strategies are in place.


Security, Compliance & Governance :


- Implement and enforce security best practices across infrastructure and applications.


- Support vulnerability management, patching, and access control.


- Ensure systems comply with organizational and regulatory standards.


- Participate in audits and compliance-related activities as required.


Required Skills & Qualifications :


Technical Skills :


- Strong experience in Site Reliability Engineering, Production Support, or DevOps roles.


- Proficiency in Linux/Unix system administration.


- Strong scripting skills in Python, Bash, Shell, or PowerShell.


- Hands-on experience with monitoring tools (Prometheus, Grafana, ELK, Splunk, Datadog, New Relic).


- Experience with incident management tools (PagerDuty, Opsgenie, ServiceNow).


- Solid understanding of networking concepts (TCP/IP, DNS, Load Balancers).


- Experience with cloud platforms (AWS / Azure / GCP).


- Familiarity with databases (SQL/NoSQL) and caching systems.


Soft Skills :


- Strong problem-solving and analytical skills.


- Excellent communication and stakeholder management abilities.


- Ability to perform under pressure in high-severity production incidents.


- Mentorship mindset and ability to guide junior engineers.


- Strong ownership and accountability for system reliability


info-icon

Did you find something suspicious?

Similar jobs that you might be interested in