Posted on: 23/04/2026
Description :
Responsibilities :
- System Design and Architecture : Collaborate with cross-functional teams to design and implement scalable, reliable, and efficient systems. Participate in system architecture discussions, provide recommendations, and drive improvements to meet business objectives.
- System Monitoring and Performance : Develop and implement robust monitoring systems to proactively identify and resolve performance bottlenecks, service disruptions, and other issues affecting system reliability. Continuously monitor system performance metrics and optimize resource utilization.
- Incident Response and Troubleshooting : Respond to and resolve production incidents in a timely manner, utilizing strong troubleshooting skills and collaborating with other teams. Conduct root cause analysis to prevent future incidents and implement corrective actions.
- Automation and Tooling : Develop automation tools and scripts to streamline deployment, configuration, and monitoring processes. Implement and maintain CI/CD pipelines to ensure efficient and reliable software delivery.
- Capacity Planning and Scalability : Work closely with development teams to forecast system capacity requirements and plan for scalability. Conduct performance testing and capacity analysis to ensure systems can handle increased loads and peak traffic.
- Security and Compliance : Implement and maintain security measures and best practices to protect our infrastructure and data. Stay up to date with the latest security vulnerabilities and apply necessary patches and upgrades.
- Collaboration and Documentation : Foster strong collaboration with cross-functional teams, including developers, operations, and QA. Document system configurations, processes, and procedures to facilitate knowledge sharing and ensure a smooth handover of responsibilities.
Qualifications and Skills :
- Bachelors degree in computer science, Engineering, or a related field (or equivalent practical experience).
- Strong experience in a Site Reliability Engineering role or a similar capacity, managing large-scale, highly available production systems.
- Proficiency in programming and scripting languages (e.g., Python, Bash, Ruby).
- Deep understanding of Linux/Unix systems and networking concepts.
- Experience with cloud platforms (e.g., AWS, Azure, GCP) and containerization technologies (e.g., Docker, Kubernetes).
- Familiarity with infrastructure-as-code tools (e.g., Terraform, Ansible) and configuration management tools (e.g., Chef, Puppet).
- Knowledge of monitoring and logging tools (e.g., Prometheus, ELK stack) and incident management systems (e.g., PagerDuty).
- Strong problem-solving and analytical skills, with the ability to quickly identify and resolve complex technical issues.
- Excellent communication and collaboration skills, with the ability to work effectively in a team-oriented environment.
Did you find something suspicious?
Posted by
Posted in
DevOps / SRE
Functional Area
DevOps / Cloud
Job Code
1630755