Description :

Job Summary :

We are seeking a seasoned Site Reliability Engineer (SRE) to join our growing team. This is a critical role in ensuring the reliability, scalability, and performance of our cloud infrastructure on AWS. You will leverage your expertise in automation, infrastructure management, and cost optimization to build and maintain resilient systems that support our business objectives. This role requires a proactive, results-oriented individual with a passion for building and maintaining robust, scalable systems.

Responsibilities :

- Design, deploy, and manage highly available and scalable infrastructure on AWS.

- Automate infrastructure provisioning and configuration using tools such as Terraform and Ansible.

- Develop and implement monitoring and alerting systems to proactively identify and troubleshoot incidents.

- Optimize infrastructure costs on AWS through effective resource management and utilization analysis.

- Collaborate with development teams to implement DevOps practices and ensure smooth deployments.

- Participate in on-call rotations and respond diligently to incidents to minimize downtime.

- Continuously improve infrastructure reliability and performance through automation and best practices.

- Stay up to date with the latest trends and technologies in cloud computing and SRE principles.

Qualifications :

- 4+ years of experience in Site Reliability Engineering or a related field (DevOps).

- Proven expertise in deploying and managing infrastructure on AWS (EC2, S3, VPC, etc.).

- Strong experience with Linux operating systems is required; prior experience as a Linux administrator is a plus.

- Strong understanding of networking fundamentals is required.

- Solid knowledge of infrastructure automation tools such as Terraform and Ansible.

- Experience with DevOps methodologies and CI/CD pipelines.

- A strong understanding of AWS cost optimization principles.

- Excellent problem-solving and analytical skills.

- Ability to work independently as well as part of a cross-functional team.

- A diligent and proactive approach to incident response.

- Willingness to participate in on-call rotations.

Good to Have :

- Experience with SOC compliance frameworks (SOC 2, HIPAA, etc.).

- Experience with container orchestration tools such as Kubernetes.