Key Responsibilities :

- Design, implement, and maintain comprehensive monitoring, logging, and alerting solutions across our production and other environments

- Lead incident response and post-mortem analyses, establishing best practices for problem resolution

- Design and implement disaster recovery strategies and ensure regular testing

- Collaborate with development teams and other stakeholders to implement SLAs for critical services

- Optimize cloud infrastructure for performance, reliability, and cost efficiency

- Develop and maintain automation for deployment, scaling, and recovery procedures

- Run and maintain our infrastructure with cookbooks using Terraform, GitLab CI/CD, and Kubernetes

- Responding to on-call incidents

Required Skills & Experience :

- 4+ years of experience in SRE, DevOps, or similar roles

- Work in a variety of languages: Shell, Chef (recipes, cookbooks) and Ansible (basic syntax, tasks, playbooks), Python

- Strong experience in AWS related services: Cognito EC2, EKS, RDS, CloudWatch, etc.,

- Proficient in Kubernetes administration and operations in production environments

- Experience with infrastructure as code using tools like Terraform or CloudFormation

- Strong scripting skills with Python, Bash, or similar languages

- Deep understanding of observability tools such as Prometheus, Grafana, ELK stack, and distributed tracing systems

- Provisioning and setup of metric in Prometheus, Grafana and alerts; Provision and setup logs and queries for general questions

- Experience with PostgreSQL or similar database systems, including replication strategies

- Knowledge of network protocols, load balancing, and security best practices

- Experience with CI/CD pipelines and Git Ops workflows

Did you find something suspicious?

Similar jobs that you might be interested in

Posted by

Tania

IT Recruiter at Wits Innovation Lab

Last Active: 3 Apr 2026

Job Views:
404

Applications: 154

Recruiter Actions: 117

Posted in

DevOps / SRE

Functional Area

Site Reliability Engineering

Job Code

1573154

Jobs by location

Interview Questions for you

View All

How to Write Leave Application for Urgent Work: Format & Samples (2025)

Top 90+ Machine Learning Interview Questions and Answers

Top 40+ Deep Learning Interview Questions and Answers