Posted on: 25/08/2025
We are seeking an experienced Kubernetes Expert who will be responsible for designing, implementing, and managing large-scale Kubernetes clusters with a strong focus on performance, security, and reliability.
Key Responsibilities :
- Design, deploy, and manage highly available Kubernetes clusters across multi-cloud and on-prem environments.
- Implement security best practices, role-based access control (RBAC), and compliance policies.
- Ensure smooth scaling, monitoring, and troubleshooting of clusters to meet enterprise-grade requirements.
- Integrate GPU support within Kubernetes clusters to optimize performance for AI/ML workloads.
- Collaborate with data science and engineering teams to ensure seamless execution of GPU-intensive applications.
- Develop and implement metering and monitoring solutions to track cloud resource consumption.
- Optimize resource allocation and provide insights for cost optimization and efficiency.
- Provide expertise on integrating Kubernetes with OpenStack environments.
- Manage and optimize hybrid cloud deployments leveraging both Kubernetes and OpenStack.
- Work closely with DevOps, Cloud, and Infrastructure teams to implement best practices.
- Prepare detailed documentation, runbooks, and guidelines for cluster operations.
Required Expertise & Skills :
- Proven experience in designing, deploying, and managing Kubernetes clusters at scale.
- Hands-on experience in enabling GPU support in Kubernetes for AI/ML workloads.
- Strong knowledge of containerization technologies (Docker, CRI-O, containerd, etc.).
- Experience with monitoring and metering solutions (Prometheus, Grafana, custom tooling, etc.) for cloud resource utilization.
- Understanding of networking concepts within Kubernetes (CNI plugins, ingress, service mesh, etc.).
- Good knowledge of OpenStack services and experience with Kubernetes-OpenStack integration (preferred).
- Strong problem-solving, debugging, and performance-tuning skills.
- Familiarity with CI/CD pipelines and automation tools (Helm, Ansible, Terraform, ArgoCD, etc.).
Did you find something suspicious?
Posted By
Posted in
DevOps / SRE
Functional Area
DevOps / Cloud
Job Code
1535401
Interview Questions for you
View All