Description :

- Design, deploy, and manage highly available and scalable infrastructure on AWS Cloud.

- Implement and maintain CI/CD pipelines using GitHub Actions.

- Manage and optimize Kubernetes clusters (EKS) for containerized workloads.

- Implement monitoring, logging, and observability solutions using Prometheus, Grafana, Loki, Promtail, Coralogix

- Ensure high availability, reliability, and performance of production systems.

- Plan, implement, and execute Disaster Recovery (DR) strategies, including DR drills and failover testing.

- Automate infrastructure provisioning, deployment, and configuration management.

- Troubleshoot production issues, perform root cause analysis, and provide permanent fixes.

- Collaborate with development, QA, and security teams to streamline DevOps workflows.

- Maintain documentation for infrastructure, deployment, and DR processes.

- Ensure best practices in security, compliance, and cost optimization.

Required Skills & Qualifications :

Core Technical Skills :

- AWS Cloud (Expert level) EC2, S3, IAM, VPC, RDS, ELB, Auto Scaling, CloudWatch, Route 53, Lambda.

- Kubernetes (Expert level) Cluster setup, management, scaling, upgrades, and troubleshooting.

- CI/CD : GitHub Actions

- Monitoring & Logging

- Prometheus

- Grafana

- Loki

- Promtail

- Coralogix

- Disaster Recovery (DR) : DR strategy, backup, failover, testing, and documentation.

Additional Skills (Good to Have) :

- Infrastructure as Code (Terraform / CloudFormation)

- Docker & containerization

- Linux system administration & scripting (Bash / Python)

- Security best practices, IAM policies, secrets management