We are seeking a highly experienced and technically proficient Site Reliability Engineer (SRE) to join our team in support of our client, Qincline. The ideal candidate will have 7 or more years of dedicated experience in Site Reliability Engineering or a closely related discipline. This pivotal role requires a strong focus on ensuring the reliability, scalability, performance, and operational efficiency of large-scale, complex production systems. You'll be instrumental in bridging the gap between development and operations by applying engineering principles to operational challenges.

Key Responsibilities :

Reliability & Performance Engineering :

- System Reliability : Design, build, and maintain robust, fault-tolerant production systems and infrastructure to meet stringent Service Level Objectives (SLOs).

- Performance Tuning : Proactively identify and resolve performance bottlenecks across the entire application stack, from infrastructure to application code.

- Automation : Develop and implement automation for operational tasks, infrastructure provisioning, deployment, and monitoring to eliminate manual toil.

- Capacity Planning : Collaborate with development teams on capacity planning, forecasting demand, and ensuring the infrastructure can scale efficiently to meet future business needs.

Operations & Incident Management :

- Monitoring & Alerting : Establish and maintain comprehensive monitoring, logging, and alerting systems to gain deep visibility into system health and performance (e.g., using Prometheus, Grafana, ELK Stack, etc.).

- Incident Response : Serve as a key responder during critical incidents, performing rapid triage, mitigation, and recovery.

- Post-Mortems & RCA : Lead detailed Post-Mortem and Root Cause Analysis (RCA) processes for all significant incidents, ensuring that permanent fixes and preventative measures are implemented to prevent recurrence.

- On-Call : Participate in a periodic on-call rotation to provide 24/7 support for critical production systems.

Tooling & Infrastructure :

- CI/CD & DevOps : Enhance and manage CI/CD pipelines to facilitate fast, reliable, and automated software releases.

- Containerization & Orchestration : Manage and optimize containerized environments using Docker and Kubernetes.

- Infrastructure as Code (IaC) : Utilize IaC tools (e.g., Terraform, Ansible) to provision and manage infrastructure in a repeatable and documented manner.

Required Skills & Experience :

Core Experience (7+ Years) :

- Minimum 7 years of hands-on experience in a Site Reliability Engineer, DevOps Engineer, or Production Engineer role supporting high-availability, mission-critical production environments.

- Deep expertise in establishing and improving system monitoring, logging, alerting, and telemetry practices.

- Demonstrated experience with formal Incident Management processes and leading thorough Root Cause Analysis (RCA).

Technical Expertise :

- Cloud Platforms : Extensive, hands-on experience with at least one major cloud provider (e.g., AWS, Azure, or GCP). This includes managing compute, networking, storage, and managed services.

- Scripting & Programming : Strong proficiency in scripting and programming languages, with mandatory expertise in Python and Shell scripting for automation and tooling.

- DevOps Tooling : Proven experience with CI/CD pipeline tools (e.g., Jenkins, GitLab CI, Azure DevOps), Git, and artifact repositories.

- Containerization : Expert-level knowledge of Docker and robust experience with orchestrating large-scale deployments using Kubernetes.

- Operating Systems : Strong command of Linux/Unix operating systems and networking fundamentals (TCP/IP, DNS, Load Balancing).

Desired Qualifications (Good to Have) :

- Experience with configuration management tools (e.g., Ansible, Chef, Puppet).

- Familiarity with service mesh technologies (e.g., Istio, Linkerd).

- Knowledge of database administration and performance tuning (SQL/NoSQL).

- Certifications related to SRE, Cloud (e.g., AWS Certified DevOps Engineer), or Kubernetes (CKA, CKAD).