Posted on: 06/04/2026
Company Overview :
Wits Innovation Lab is a rapidly growing technology company specializing in providing cutting-edge AI-powered solutions for the financial services industry. We develop and deploy sophisticated algorithms and platforms that enable our clients to optimize trading strategies, manage risk effectively, and enhance customer experiences. Our solutions are used by leading financial institutions globally, processing billions of transactions daily.
Role Overview :
As a Site Reliability Engineer (SRE) at Wits Innovation Lab, you will be instrumental in ensuring the reliability, availability, and performance of our critical AI-driven financial platforms. You will collaborate closely with development, operations, and security teams to design, implement, and maintain robust infrastructure and automation solutions. Your expertise will directly impact the stability and scalability of our services, enabling us to deliver exceptional value to our clients.
Key Responsibilities :
- Design and implement scalable and resilient infrastructure solutions on AWS to support our AI platforms.
- Automate infrastructure provisioning, configuration management, and application deployments using tools like Ansible, Terraform, and Kubernetes to improve efficiency and reduce manual effort.
- Monitor system performance, identify bottlenecks, and implement proactive measures to prevent outages and ensure optimal performance for our financial applications.
- Develop and maintain CI/CD pipelines to enable rapid and reliable software releases, ensuring continuous delivery of new features and bug fixes.
- Troubleshoot and resolve complex production issues, collaborating with cross-functional teams to minimize downtime and restore services quickly.
- Participate in on-call rotations to provide 24/7 support for critical systems, ensuring business continuity and minimizing impact to our clients.
- Implement and maintain security best practices to protect our infrastructure and data, ensuring compliance with industry regulations and standards.
- Contribute to the development of comprehensive documentation and knowledge base articles to facilitate knowledge sharing and improve team efficiency.
Required Skillset :
- Proven ability to design, implement, and manage infrastructure on AWS, including EC2, S3, VPC, and other relevant services.
- Deep understanding of Linux system administration, including performance tuning, security hardening, and troubleshooting.
- Extensive experience with configuration management tools like Ansible, Chef, or Puppet.
- Strong proficiency in scripting languages such as Python, Bash, or Go.
- Solid understanding of containerization technologies like Docker and orchestration platforms like Kubernetes.
- Hands-on experience with CI/CD pipelines and related tools like Jenkins, GitLab CI, or CircleCI.
- Demonstrated ability to troubleshoot complex production issues and implement effective solutions.
- Excellent communication and collaboration skills, with the ability to work effectively in a fast-paced, agile environment.
- Experience with monitoring tools like Prometheus, Grafana, or ELK stack.
- Familiarity with Redis or other in-memory data stores.
- Bachelor's degree in Computer Science or a related field.
Did you find something suspicious?
Posted by
Posted in
DevOps / SRE
Functional Area
Site Reliability Engineering
Job Code
1626313