Posted on: 11/11/2025
Key Responsibilities :
- Design, implement, and maintain comprehensive monitoring, logging, and alerting solutions across our production and other environments
- Lead incident response and post-mortem analyses, establishing best practices for problem resolution
- Design and implement disaster recovery strategies and ensure regular testing
- Collaborate with development teams and other stakeholders to implement SLAs for critical services
- Optimize cloud infrastructure for performance, reliability, and cost efficiency
- Develop and maintain automation for deployment, scaling, and recovery procedures
- Run and maintain our infrastructure with cookbooks using Terraform, GitLab CI/CD, and Kubernetes
- Responding to on-call incidents
Required Skills & Experience :
- 4+ years of experience in SRE, DevOps, or similar roles
- Work in a variety of languages: Shell, Chef (recipes, cookbooks) and Ansible (basic syntax, tasks, playbooks), Python
- Strong experience in AWS related services: Cognito EC2, EKS, RDS, CloudWatch, etc.,
- Proficient in Kubernetes administration and operations in production environments
- Experience with infrastructure as code using tools like Terraform or CloudFormation
- Strong scripting skills with Python, Bash, or similar languages
- Deep understanding of observability tools such as Prometheus, Grafana, ELK stack, and distributed tracing systems
- Provisioning and setup of metric in Prometheus, Grafana and alerts; Provision and setup logs and queries for general questions
- Experience with PostgreSQL or similar database systems, including replication strategies
- Knowledge of network protocols, load balancing, and security best practices
- Experience with CI/CD pipelines and Git Ops workflows
Did you find something suspicious?
Posted By
Posted in
DevOps / SRE
Functional Area
Site Reliability Engineering
Job Code
1573154
Interview Questions for you
View All