We are seeking an experienced Site Reliability Engineer (SRE) to join our high-performance infrastructure and operations team. As an SRE, you will be responsible for ensuring the availability, scalability, performance, and reliability of our production systems. You will work closely with engineering, product, and platform teams to build robust monitoring systems, automate operational tasks, and drive incident management and root cause analysis.

Key Responsibilities :

1. Incident & Alert Management :

- Monitor production systems and handle alerts to ensure minimal service disruption.

- Act as the first point of escalation for production incidents and critical system issues.

- Drive rapid resolution of major incidents to restore services as quickly as possible.

- Coordinate with cross-functional teams, vendors, and service providers to resolve unresolved incidents following defined escalation procedures.

2. Monitoring & Observability :

- Design, implement, and maintain application and infrastructure monitoring using tools such as OpenSearch, ELK, Grafana, Prometheus, PagerDuty, Pingdom, Datadog, and Splunk.

- Ensure robust logging, metrics, and distributed tracing practices are in place to provide full observability into system performance.

- Regularly review and refine monitoring configurations to align with evolving system needs.

3. Automation & Reliability Engineering :

- Collaborate with product and platform engineering teams to develop SOPs (Standard Operating Procedures) for operational excellence.

- Automate deployment, scaling, and operational tasks using tools like Ansible, Kubernetes, and CI/CD frameworks.

- Implement proof-of-concepts (POCs) for new tools and technologies with the aim of integrating them into production environments.

4. Root Cause Analysis & Continuous Improvement :

- Perform detailed root cause analysis for service-impacting events.

- Identify trends and recurring issues to proactively improve system stability.

- Contribute to post-incident reviews and recommend preventive measures.

5. Collaboration & Knowledge Sharing :

- Work in a collaborative, Agile environment, actively participating in sprint planning, retrospectives, and technical discussions.

- Seek expertise from domain specialists and share knowledge with peers.

- Provide technical guidance to junior engineers.

Requirements & Qualifications :

Technical Skills :

- Monitoring & Observability Tools : Hands-on experience with OpenSearch, ELK, Grafana, Prometheus, PagerDuty, Pingdom, Datadog, and Splunk.

- Programming/Scripting : Proficiency in at least two of the following Python, Shell, Ansible (Golang is a plus).

- Cloud & Infrastructure : Strong experience with AWS services, containerized applications, Kubernetes orchestration, and infrastructure automation.

- CI/CD & Developer Tools : Experience with GitLab, Jenkins, and modern CI/CD pipelines.

- System Architecture : Understanding of distributed systems, networking fundamentals, and high-availability architecture.

Soft Skills :

- Strong problem-solving and analytical abilities.

- Excellent communication and documentation skills.

- Ability to work effectively in high-pressure situations and tight deadlines.

- Strong organizational skills with the ability to manage multiple priorities.

Preferred Qualifications :

- Experience with large-scale, mission-critical production systems.

- Familiarity with Agile methodologies and DevOps practices.

- Prior experience driving POCs for production-scale technology adoption.