Posted on: 29/10/2025
Job Description :
1. AWS Cloud Infrastructure :
- Design, deploy, and manage scalable, secure, and highly available systems on AWS.
- Optimize cloud costs, enforce tagging, and implement security best practices (IAM, VPC, GuardDuty, etc.).
- Automate infrastructure provisioning using Terraform or AWS CDK.
- Ensure backup, disaster recovery, and high availability (HA) strategies are in place.
2. Kubernetes (EKS preferred) :
- Manage and scale Kubernetes clusters (preferably Amazon EKS).
- Implement CI/CD pipelines with GitOps (e.g., ArgoCD or Flux) or traditional tools (e.g., Jenkins, GitLab).
- Enforce RBAC policies, namespaces isolation, and pod security policies.
- Monitor cluster health, optimize pod scheduling, autoscaling, and resource limits/requests.
3. Monitoring and Observability (Datadog) :
- Build and maintain Datadog dashboards for real-time visibility across systems and services.
- Set up alerting policies, SLOs, SLIs, and incident response workflows.
- Integrate Datadog with AWS, Kubernetes, and applications for full-stack observability.
- Conduct post-incident reviews using Datadog analytics to reduce MTTR.
4. Automation and DevOps :
- Automate manual processes (e.g., server setup, patching, scaling) using Python, Bash, or Ansible.
- Maintain and improve CI/CD pipelines (Jenkins) for faster and more reliable deployments.
- Drive Infrastructure-as-Code (IaC) practices using Terraform to manage cloud resources.
- Promote GitOps and version-controlled deployments.
5. Linux Systems Administration :
- Administer Linux servers (Ubuntu, RHEL, Amazon Linux) for stability and performance.
- Harden OS security, configure SELinux, firewalls, and ensure timely patching.
- Troubleshoot system-level issues: disk, memory, network, and processes.
- Optimize system performance using tools like top, htop, iotop, netstat, etc.
Did you find something suspicious?
Posted By
Posted in
DevOps / SRE
Functional Area
Site Reliability Engineering
Job Code
1566818
Interview Questions for you
View All