We are looking for a dedicated DevOps/Site Reliability Engineer to enhance the reliability, speed, and security of our software delivery pipeline and production infrastructure.

You will champion automation, ensure system uptime, and drive a culture of continuous improvement between development and operations teams.

Key Responsibilities :

- Manage, scale, and optimize our core production infrastructure built on Kubernetes clusters hosted on AWS or Azure.

- Develop and maintain Infrastructure as Code (IaC) using Terraform or Ansible to automate provisioning and configuration.

- Design, implement, and optimize robust CI/CD pipelines using Jenkins or GitHub Actions for automated build, test, and deployment.

- Configure and manage comprehensive monitoring, logging, and alerting systems using the ELK Stack (Elasticsearch, Logstash, Kibana) or Prometheus and Grafana.

- Participate in incident management, perform in-depth root cause analysis (RCA), and implement preventative measures to meet stringent SLOs/SLAs.

Technical Skills Required:

- 5+ years of experience in DevOps, SRE, or a related role.

- In-depth, hands-on experience with Kubernetes and Docker.

- Expertise with at least one major IaC tool: Terraform or Ansible.

- Strong scripting skills in Python and Bash.

- Proven experience administering and troubleshooting Linux operating systems.

- Proficiency with cloud services (AWS, Azure, or GCP)