Description :

- Design, implement, and maintain monitoring, logging, and alerting systems across all environments

- Lead incident response, root cause analysis, and drive post-mortem improvements

- Develop disaster recovery strategies and run regular DR drills

- Work with engineering teams to define and maintain SLAs/SLOs

- Optimize cloud infrastructure for reliability, performance, and cost

- Build automation for deployments, scaling, and recovery

- Manage infrastructure through IaC tools like Terraform, GitLab CI/CD, and Kubernetes

- Participate in on-call rotations and respond to incidents

Required Skills & Experience :

- 4+ years in SRE, DevOps, or similar roles

- Strong scripting skills : Python, Bash, Shell

- Experience with Chef (cookbooks/recipes) and Ansible (tasks/playbooks)

- Hands-on experience with AWS services (Cognito, EC2, EKS, RDS, CloudWatch, etc.)

- Strong Kubernetes administration experience in production

- Proficiency in Terraform or CloudFormation

- Excellent understanding of observability tools : Prometheus, Grafana, ELK, tracing

- Experience provisioning metrics, dashboards, queries, and alert rules

- Knowledge of PostgreSQL (including replication)

- Strong understanding of networking, load balancing & security best practices

- Experience working with CI/CD and GitOps workflows