Posted on: 12/01/2026
Description :
Key Responsibilities :
Site Reliability Engineering (SRE) :
- Design, implement, and maintain highly available, scalable, and fault-tolerant systems.
- Define and monitor Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Service Level Agreements (SLAs).
- Develop and maintain monitoring, alerting, logging, and observability solutions.
- Automate operational tasks to reduce toil using scripts, tools, and CI/CD pipelines.
- Participate in capacity planning, performance tuning, and system optimization.
- Lead post-incident reviews (RCA) and drive corrective and preventive actions.
- Champion reliability-first design and continuous improvement initiatives.
Production Support & Incident Management :
- Provide L2/L3 production support for business-critical applications and platforms.
- Act as an escalation point during major incidents, outages, and performance degradations.
- Lead incident response, troubleshooting, and recovery in a 24x7 production environment.
- Coordinate with cross-functional teams during incidents to ensure rapid resolution.
- Maintain runbooks, SOPs, and knowledge base documentation.
- Analyze recurring issues and implement long-term fixes to prevent reoccurrence.
DevOps & Cloud Operations :
- Manage and support cloud infrastructure (AWS / Azure / GCP).
- Work with containerization and orchestration platforms such as Docker and Kubernetes.
- Support and enhance CI/CD pipelines for reliable and repeatable deployments.
- Implement Infrastructure as Code (IaC) using tools like Terraform, CloudFormation, or ARM.
- Ensure backup, disaster recovery, and business continuity strategies are in place.
Security, Compliance & Governance :
- Implement and enforce security best practices across infrastructure and applications.
- Support vulnerability management, patching, and access control.
- Ensure systems comply with organizational and regulatory standards.
- Participate in audits and compliance-related activities as required.
Required Skills & Qualifications :
Technical Skills :
- Strong experience in Site Reliability Engineering, Production Support, or DevOps roles.
- Proficiency in Linux/Unix system administration.
- Strong scripting skills in Python, Bash, Shell, or PowerShell.
- Hands-on experience with monitoring tools (Prometheus, Grafana, ELK, Splunk, Datadog, New Relic).
- Experience with incident management tools (PagerDuty, Opsgenie, ServiceNow).
- Solid understanding of networking concepts (TCP/IP, DNS, Load Balancers).
- Experience with cloud platforms (AWS / Azure / GCP).
- Familiarity with databases (SQL/NoSQL) and caching systems.
Soft Skills :
- Strong problem-solving and analytical skills.
- Excellent communication and stakeholder management abilities.
- Ability to perform under pressure in high-severity production incidents.
- Mentorship mindset and ability to guide junior engineers.
- Strong ownership and accountability for system reliability
Did you find something suspicious?
Posted by
Posted in
DevOps / SRE
Functional Area
Site Reliability Engineering
Job Code
1600227