HamburgerMenu
hirist

Lead Site Reliability Engineer - AWS/Azure

Posted on: 08/12/2025

Job Description

Description :



Responsibilities :



- Manage and mentor a team of SREs, assigning tasks, providing technical guidance, and fostering a culture of collaboration and continuous learning.



- Lead the implementation of reliable, scalable, and fault-tolerant systems, including infrastructure, monitoring, and alerting.



- Manage incident response processes, including root cause analysis, post-mortem reviews, and proactive mitigation strategies to minimise system downtime and impact.



- Develop and maintain comprehensive monitoring systems to identify potential issues early, set appropriate alerting thresholds, and optimise system performance.



- Drive automation initiatives to streamline operational tasks, including deployments, scaling, and configuration management, utilising relevant tools and technologies.



- Proactively assess system capacity needs, plan for future growth, and implement scaling strategies to ensure optimal performance under load.



- Analyse system metrics and identify bottlenecks, implement performance improvements, and optimise resource utilisation.



- Work closely with development teams, product managers, and other stakeholders to ensure alignment on reliability goals and smooth integration of new features.



- Develop and implement the SRE roadmap, including technology adoption, standards, and best practices to maintain a high level of system reliability.



Requirements :



- Strong proficiency in system administration, cloud computing (AWS, Azure), networking, distributed systems, and containerization technologies (Docker, Kubernetes).



- Expertise in scripting languages (Python, Bash) and ability to develop automation tools.



- Good to have a basic understanding of Java.



- Deep understanding of monitoring systems (Prometheus, Grafana), alerting configurations, and log analysis.



- Proven experience in managing critical incidents, performing root cause analysis, and coordinating response efforts.



- Excellent communication skills to convey technical concepts to both technical and non-technical audiences, ability to lead and motivate a team.



- Strong analytical and troubleshooting skills to identify and resolve complex technical issues.



info-icon

Did you find something suspicious?