Posted on: 18/11/2025
Description :
- Design, implement, and maintain monitoring, logging, and alerting systems across all environments
- Lead incident response, root cause analysis, and drive post-mortem improvements
- Develop disaster recovery strategies and run regular DR drills
- Work with engineering teams to define and maintain SLAs/SLOs
- Optimize cloud infrastructure for reliability, performance, and cost
- Build automation for deployments, scaling, and recovery
- Manage infrastructure through IaC tools like Terraform, GitLab CI/CD, and Kubernetes
- Participate in on-call rotations and respond to incidents
Required Skills & Experience :
- 4+ years in SRE, DevOps, or similar roles
- Strong scripting skills : Python, Bash, Shell
- Experience with Chef (cookbooks/recipes) and Ansible (tasks/playbooks)
- Hands-on experience with AWS services (Cognito, EC2, EKS, RDS, CloudWatch, etc.)
- Strong Kubernetes administration experience in production
- Proficiency in Terraform or CloudFormation
- Excellent understanding of observability tools : Prometheus, Grafana, ELK, tracing
- Experience provisioning metrics, dashboards, queries, and alert rules
- Knowledge of PostgreSQL (including replication)
- Strong understanding of networking, load balancing & security best practices
- Experience working with CI/CD and GitOps workflows
Did you find something suspicious?
Posted by
Posted in
DevOps / SRE
Functional Area
DevOps / Cloud
Job Code
1576165
Interview Questions for you
View All