Posted on: 25/09/2025
Role Overview :
We are seeking a highly experienced and technically proficient Site Reliability Engineer (SRE) to join our team in support of our client, Qincline. The ideal candidate will have 7 or more years of dedicated experience in Site Reliability Engineering or a closely related discipline. This pivotal role requires a strong focus on ensuring the reliability, scalability, performance, and operational efficiency of large-scale, complex production systems. You'll be instrumental in bridging the gap between development and operations by applying engineering principles to operational challenges.
Key Responsibilities :
Reliability & Performance Engineering :
- System Reliability : Design, build, and maintain robust, fault-tolerant production systems and infrastructure to meet stringent Service Level Objectives (SLOs).
- Performance Tuning : Proactively identify and resolve performance bottlenecks across the entire application stack, from infrastructure to application code.
- Automation : Develop and implement automation for operational tasks, infrastructure provisioning, deployment, and monitoring to eliminate manual toil.
- Capacity Planning : Collaborate with development teams on capacity planning, forecasting demand, and ensuring the infrastructure can scale efficiently to meet future business needs.
Operations & Incident Management :
- Monitoring & Alerting : Establish and maintain comprehensive monitoring, logging, and alerting systems to gain deep visibility into system health and performance (e.g., using Prometheus, Grafana, ELK Stack, etc.).
- Incident Response : Serve as a key responder during critical incidents, performing rapid triage, mitigation, and recovery.
- Post-Mortems & RCA : Lead detailed Post-Mortem and Root Cause Analysis (RCA) processes for all significant incidents, ensuring that permanent fixes and preventative measures are implemented to prevent recurrence.
- On-Call : Participate in a periodic on-call rotation to provide 24/7 support for critical production systems.
Tooling & Infrastructure :
- CI/CD & DevOps : Enhance and manage CI/CD pipelines to facilitate fast, reliable, and automated software releases.
- Containerization & Orchestration : Manage and optimize containerized environments using Docker and Kubernetes.
- Infrastructure as Code (IaC) : Utilize IaC tools (e.g., Terraform, Ansible) to provision and manage infrastructure in a repeatable and documented manner.
Required Skills & Experience :
Core Experience (7+ Years) :
- Minimum 7 years of hands-on experience in a Site Reliability Engineer, DevOps Engineer, or Production Engineer role supporting high-availability, mission-critical production environments.
- Deep expertise in establishing and improving system monitoring, logging, alerting, and telemetry practices.
- Demonstrated experience with formal Incident Management processes and leading thorough Root Cause Analysis (RCA).
Technical Expertise :
- Cloud Platforms : Extensive, hands-on experience with at least one major cloud provider (e.g., AWS, Azure, or GCP). This includes managing compute, networking, storage, and managed services.
- Scripting & Programming : Strong proficiency in scripting and programming languages, with mandatory expertise in Python and Shell scripting for automation and tooling.
- DevOps Tooling : Proven experience with CI/CD pipeline tools (e.g., Jenkins, GitLab CI, Azure DevOps), Git, and artifact repositories.
- Containerization : Expert-level knowledge of Docker and robust experience with orchestrating large-scale deployments using Kubernetes.
- Operating Systems : Strong command of Linux/Unix operating systems and networking fundamentals (TCP/IP, DNS, Load Balancing).
Desired Qualifications (Good to Have) :
- Experience with configuration management tools (e.g., Ansible, Chef, Puppet).
- Familiarity with service mesh technologies (e.g., Istio, Linkerd).
- Knowledge of database administration and performance tuning (SQL/NoSQL).
- Certifications related to SRE, Cloud (e.g., AWS Certified DevOps Engineer), or Kubernetes (CKA, CKAD).
Did you find something suspicious?
Posted By
Posted in
DevOps / SRE
Functional Area
Site Reliability Engineering
Job Code
1552568
Interview Questions for you
View All