The Sr. SRE will lead the implementation and management of the observability stack across cloud infrastructure, ensuring reliability, scalability, performance, and cost-efficiency. The role spans across Kubernetes, AWS, automation, incident response, and platform reliability.

Key Responsibilities :

- Build and maintain monitoring, logging, and alerting solutions.

- Lead incident response & post-mortem best practices.

- Design & test disaster recovery strategies.

- Collaborate with dev teams to define SLAs.

- Optimize cloud infra (AWS) for cost and performance.

- Automate deployments, scaling & recovery using Terraform, GitLab CI/CD, Kubernetes.

- Handle on-call support.

Required Skills & Experience

- 4+ years in SRE/DevOps.

- Proficiency in Shell, Chef, Ansible, Python.

- Strong AWS services experience (EC2, EKS, RDS, CloudWatch, Cognito, etc.).

- Kubernetes administration in production.

- IaC: Terraform / CloudFormation.

- Observability tools: Prometheus, Grafana, ELK, tracing systems.

- PostgreSQL (including replication).

- Networking, load balancing, security best practices.

- CI/CD pipelines & GitOps workflows.

- Ability to handle high-pressure incidents.

- Exposure to Splunk, Datadog, Dynatrace (plus point).

Preferred :

- AWS Certified Solutions Architect / DevOps Engineer.

- Certified Kubernetes Administrator (CKA).

Did you find something suspicious?

Posted By

Kanan Uppal

IT Recruiter at Wits Innovation Lab

Last Active: 5 Dec 2025

Job Views:
70

Applications: 23

Recruiter Actions: 20

Posted in

DevOps / SRE

Functional Area

DevOps / Cloud

Job Code

1532947

Jobs by location

Interview Questions for you

View All

How to Write Leave Application for Urgent Work: Format & Samples (2025)

Top 90+ Machine Learning Interview Questions and Answers

Top 40+ Deep Learning Interview Questions and Answers