HamburgerMenu
hirist

Job Description

Key Responsibilities :

- Design, implement, and maintain comprehensive monitoring, logging, and alerting solutions across our production and other environments

- Lead incident response and post-mortem analyses, establishing best practices for problem resolution

- Design and implement disaster recovery strategies and ensure regular testing

- Collaborate with development teams and other stakeholders to implement SLAs for critical services

- Optimize cloud infrastructure for performance, reliability, and cost efficiency

- Develop and maintain automation for deployment, scaling, and recovery procedures

- Run and maintain our infrastructure with cookbooks using Terraform, GitLab CI/CD, and Kubernetes

- Responding to on-call incidents


Required Skills & Experience :

- 4+ years of experience in SRE, DevOps, or similar roles

- Work in a variety of languages: Shell, Chef (recipes, cookbooks) and Ansible (basic syntax, tasks, playbooks), Python

- Strong experience in AWS related services: Cognito EC2, EKS, RDS, CloudWatch, etc.,

- Proficient in Kubernetes administration and operations in production environments

- Experience with infrastructure as code using tools like Terraform or CloudFormation

- Strong scripting skills with Python, Bash, or similar languages

- Deep understanding of observability tools such as Prometheus, Grafana, ELK stack, and distributed tracing systems

- Provisioning and setup of metric in Prometheus, Grafana and alerts; Provision and setup logs and queries for general questions

- Experience with PostgreSQL or similar database systems, including replication strategies

- Knowledge of network protocols, load balancing, and security best practices

- Experience with CI/CD pipelines and Git Ops workflows


info-icon

Did you find something suspicious?