We are looking for a Site Reliability Engineer (SRE) Developer to ensure the reliability, scalability, and performance of our systems. The ideal candidate will combine software engineering, DevOps, and system administration expertise to automate infrastructure, monitor services, and maintain high uptime for critical applications.

This role involves collaborating with development and operations teams to improve reliability and streamline deployment processes.

Key Responsibilities :

- Design, implement, and maintain highly available and scalable infrastructure.

- Automate deployment, monitoring, and operational processes using scripting and IaC tools.

- Build and maintain CI/CD pipelines, ensuring smooth application deployment.

- Monitor system performance, troubleshoot incidents, and conduct root cause analysis.

- Ensure disaster recovery, backup, and security compliance across infrastructure.

- Collaborate with development teams to improve system reliability, scalability, and performance.

- Manage cloud-based infrastructure (AWS, GCP, Azure) and containerized environments (Docker, Kubernetes).

- Document operational procedures and best practices for ongoing team reference.

- Stay updated with emerging technologies and reliability engineering practices to continuously improve systems.

Required Skills & Qualifications :

- Strong experience in cloud platforms (AWS, GCP, or Azure).

- Proficiency in scripting languages such as Python, Bash, or Go.

- Hands-on experience with CI/CD tools (Jenkins, GitHub Actions, GitLab CI).

- Solid understanding of Linux/Unix systems, networking, and system administration.

- Experience with containerization and orchestration (Docker, Kubernetes).

- Knowledge of monitoring and observability tools (Prometheus, Grafana, ELK, CloudWatch).

- Strong problem-solving, analytical, and debugging skills.

- Excellent collaboration and communication skills.