Posted on: 21/04/2026
Job Description :
As a Senior DevOps Engineer (SRE) at Wits Innovation Lab, you will play a crucial role in ensuring the reliability, scalability, and performance of our cloud-based infrastructure and applications. You will collaborate closely with development, operations, and security teams to automate processes, implement monitoring solutions, and proactively identify and resolve potential issues. Your work will directly impact the stability and efficiency of our platforms, enabling us to deliver seamless services to our clients in the financial sector.
Key Responsibilities :
- Design, implement, and maintain robust monitoring and alerting systems to proactively identify and address performance bottlenecks and system failures, ensuring high availability for critical financial applications.
- Automate infrastructure provisioning, configuration management, and application deployments using Infrastructure as Code (IaC) principles, reducing manual effort and improving deployment speed and consistency.
- Collaborate with development teams to integrate security best practices into the CI/CD pipeline, ensuring secure and compliant software releases.
- Troubleshoot and resolve complex system issues, performing root cause analysis and implementing preventative measures to minimize future occurrences, improving overall system stability.
- Develop and maintain comprehensive documentation of infrastructure, processes, and procedures, enabling knowledge sharing and efficient onboarding of new team members.
- Participate in on-call rotations to provide timely support for production systems, ensuring minimal disruption to business operations.
- Optimize system performance and resource utilization through capacity planning, performance tuning, and infrastructure optimization, reducing operational costs and improving efficiency.
Required Skillset :
- Demonstrated expertise in Linux system administration, including scripting (e.g., Bash, Python) and automation tools (e.g., Ansible, Chef, Puppet).
- Proven ability to design, implement, and manage cloud infrastructure on AWS, including services such as EC2, S3, VPC, IAM, and CloudWatch.
- Strong understanding of networking principles and protocols, including TCP/IP, DNS, routing, and load balancing.
- Experience with monitoring tools such as Prometheus, Grafana, ELK stack, or similar technologies.
- Solid understanding of Site Reliability Engineering (SRE) principles and practices.
- Excellent communication and collaboration skills, with the ability to effectively communicate technical concepts to both technical and non-technical audiences.
- Ability to work independently and as part of a team in a fast-paced, dynamic environment.
- Bachelor's degree in Computer Science or a related field.
Did you find something suspicious?
Posted by
Posted in
DevOps / SRE
Functional Area
DevOps / Cloud
Job Code
1629889