- Proven experience as a Site Reliability Engineer, Sr DevOps Engineer, or similar role.

- 5 to 7 years of Relevant experience, at least 2 years of experience in Microsoft Azure. Good to have AWS and GCP.

- Experience in setting up and managing OTEL, using Loki, Tempo, Promotus, Grafana, Alloy etc.

- Experience in creating CI/CD pipelines using Azure DevOps, Jenkins, Spinnaker, Terraform, Ansible, Docker, Kubernetes etc.

Key Responsibilities :

Monitoring and Incident Response :

- Proactively monitor system performance and availability using OTEL.

- Manage incidents and troubleshoot issues in real-time.

- Implement and improve incident management processes.

Automation and Efficiency :

- Develop and maintain automation scripts and tools to enhance operational efficiency.

- Automate repetitive tasks to reduce manual interventions.

- Ensure continuous integration and delivery (CI/CD) pipelines are robust and efficient.

Performance and Capacity Management :

- Conduct performance tuning, optimization, and capacity planning.

- Perform root cause analysis and post-mortem discussions for incidents.

- Implement solutions to improve system reliability and performance.

Collaboration and Communication :

- Work closely with development teams to ensure systems are designed with reliability and scalability in mind.

- Communicate effectively with stakeholders to provide updates and insights on system health.