Posted on: 10/12/2025
Description :
About the Role :
We are looking for an experienced Senior DevOps Engineer to architect, automate, and manage our multi-region, high-availability cloud infrastructure.
This is a high-ownership role where you will lead DevOps strategy, design scalable systems, and work closely with AI/ML teams to support GPU-driven workloads.
Key Responsibilities :
- Design and manage multi-cluster EKS Kubernetes environments with autoscaling and spot/on-demand node balancing.
- Architect scalable CI/CD pipelines for backend, frontend, and AI workloads.
- Lead infrastructure automation using Terraform / AWS CDK, ensuring deterministic and repeatable deployments.
- Manage Docker image build farms, GitOps workflows, and microservices deployments.
- Implement GPU scheduling, batch jobs, and AI model inference infrastructure.
- Architect and maintain multi-region systems, RDS Multi-AZ, backups, and disaster recovery (DR) strategies.
- Own observability : Prometheus, Grafana, Loki, OpenTelemetry, Elasticsearch/OpenSearch.
- Implement cloud security best practices : IAM, VPC, KMS, WAF, Shield, penetration testing workflows.
- Drive incident response, root cause analysis, and long-term stability improvements.
- Evaluate and adopt modern DevOps tooling; run PoCs and recommend improvements.
- Mentor junior engineers and guide them in cloud, Kubernetes, and automation practices.
Must-Have Skills :
- 3+ years hands-on experience in AWS cloud infrastructure and Kubernetes (EKS).
- Strong expertise in Docker, GitHub Actions, ArgoCD, CodePipeline.
- Solid experience with Terraform / AWS CDK and Infrastructure as Code practices.
- Deep understanding of multi-region systems, Multi-AZ architectures, DR patterns.
- Proficiency with monitoring & logging stacks (Prometheus, Grafana, Loki, OpenSearch).
- Strong understanding of VPC networking, IAM, KMS, Secrets Manager.
- Strong troubleshooting, performance tuning, and debugging skills.
Nice-to-Have Skills :
- Experience with Kafka/MSK, Redis, RabbitMQ, SQS.
- Experience with GPU-heavy AI/ML workloads and Kubernetes GPU operators.
- Background in SOC2, GDPR, or DPDP compliance readiness.
Who You Are :
- A self-driven engineer who takes ownership of complex cloud systems.
- Strong communicator and collaborator across cross-functional teams.
- Capable of leading decisions, mentoring juniors, and driving technical excellence
Did you find something suspicious?
Posted by
Posted in
DevOps / SRE
Functional Area
DevOps / Cloud
Job Code
1588397
Interview Questions for you
View All