Posted on: 11/08/2025
Job Title : Site Reliability Engineer
Location : Gurgaon, India
Experience : 6 to 9 years
Employment Type : Full-Time
About the Role :
Key Responsibilities :
1. Incident & Alert Management :
- Act as the first point of escalation for production incidents and critical system issues.
- Drive rapid resolution of major incidents to restore services as quickly as possible.
- Coordinate with cross-functional teams, vendors, and service providers to resolve unresolved incidents following defined escalation procedures.
2. Monitoring & Observability :
- Ensure robust logging, metrics, and distributed tracing practices are in place to provide full observability into system performance.
- Regularly review and refine monitoring configurations to align with evolving system needs.
3. Automation & Reliability Engineering :
- Automate deployment, scaling, and operational tasks using tools like Ansible, Kubernetes, and CI/CD frameworks.
- Implement proof-of-concepts (POCs) for new tools and technologies with the aim of integrating them into production environments.
4. Root Cause Analysis & Continuous Improvement :
- Identify trends and recurring issues to proactively improve system stability.
- Contribute to post-incident reviews and recommend preventive measures.
5. Collaboration & Knowledge Sharing :
- Seek expertise from domain specialists and share knowledge with peers.
- Provide technical guidance to junior engineers.
Requirements & Qualifications :
Technical Skills :
- Monitoring & Observability Tools : Hands-on experience with OpenSearch, ELK, Grafana, Prometheus, PagerDuty, Pingdom, Datadog, and Splunk.
- Programming/Scripting : Proficiency in at least two of the following Python, Shell, Ansible (Golang is a plus).
- Cloud & Infrastructure : Strong experience with AWS services, containerized applications, Kubernetes orchestration, and infrastructure automation.
- CI/CD & Developer Tools : Experience with GitLab, Jenkins, and modern CI/CD pipelines.
- System Architecture : Understanding of distributed systems, networking fundamentals, and high-availability architecture.
Soft Skills :
- Excellent communication and documentation skills.
- Ability to work effectively in high-pressure situations and tight deadlines.
- Strong organizational skills with the ability to manage multiple priorities.
Preferred Qualifications :
- Familiarity with Agile methodologies and DevOps practices.
- Prior experience driving POCs for production-scale technology adoption.
Did you find something suspicious?
Posted By
Posted in
DevOps / SRE
Functional Area
DevOps / Cloud
Job Code
1527882
Interview Questions for you
View All