Posted on: 03/12/2025
Description :
Role Overview :
We are looking for an experienced Kubernetes with strong expertise in Kubernetes clusters, cloud-native technologies, storage integration, and performance optimisation. The ideal candidate should have hands-on experience in designing, deploying, and managing large-scale Kubernetes environments across on-prem and cloud platforms, along with troubleshooting complex containerised workloads.
Key Responsibilities :
Cluster Management & Deployment :
- Provision and manage Kubernetes clusters using kubeadm, RKE2, and Cluster API across cloud platforms (AWS, Azure, GCP, OpenStack).
- Deploy, scale, and upgrade applications using Kubernetes best practices (rolling updates, probes, HPA, VPA).
- Configure node scheduling strategies using taints, tolerations, and affinity rules.
Application Deployment & Troubleshooting :
- Debug CrashLoopBackOff and pod failures using kubectl logs, events, and resource monitoring.
- Troubleshoot networking, persistent volumes, and service exposure issues (ClusterIP, NodePort, LoadBalancer, Ingress).
- Debug application routing using APISIX, NGINX ingress, and multi-path routing.
- Handle application scaling and high-traffic scenarios using autoscalers.
Storage & Data Management :
- Integrate Ceph storage with Kubernetes via CSI drivers for block and filesystem provisioning.
- Troubleshoot PersistentVolume (PV) and PersistentVolumeClaim (PVC) issues.
Observability & Performance :
- Deploy and configure monitoring solutions such as Prometheus and Metrics Server.
- Benchmark cluster and workload performance (CPU, memory, networking).
- Enable log collection and analysis for multi-container pods.
Security & Networking :
- Manage authentication and RBAC policies within Kubernetes.
- Configure isolation for virtual Kubernetes clusters (vcluster).
- Handle registry authentication (AWS ECR, private registries) using image pull secrets.
Specialized Workloads :
- Deploy and manage GPU workloads using NVIDIA GPU Operator.
- Enable GPU scheduling and resource allocation for AI/ML workloads.
Operations & Maintenance :
- Troubleshoot faulty nodes (on-prem / cloud) including CPU, memory, disk, and kubelet health.
- Work on service routing, ingress configurations, and debugging cloud load balancer/firewall issues.
- Perform rolling upgrades and ensure zero-downtime deployments.
Required Skills :
- Strong expertise in Kubernetes administration and cloud-native deployments.
- Hands-on experience with kubeadm, RKE2, Cluster API, and Terraform for cluster provisioning.
- Knowledge of storage integration with Ceph and CSI drivers.
- Experience with monitoring and observability tools (Prometheus, Grafana, Metrics Server).
- Strong debugging skills for pod crashes, networking issues, and persistent storage problems.
- Knowledge of NGINX ingress, APISIX, and traffic routing.
- Understanding of RBAC, security groups, and IAM policies in Kubernetes & cloud.
- Experience with GPU workloads in Kubernetes.
- Familiarity with CI/CD pipelines for Kubernetes deployments is a plus.
Preferred Qualifications :
- 4+ years of hands-on experience in Kubernetes roles.
- Experience in both managed (EKS, AKS, GKE) and on-prem Kubernetes clusters.
- Strong scripting skills (Bash, Python, Go preferred).
- Prior experience with infrastructure-as-code tools like Terraform, Helm, and Ansible.
- Exposure to multi-cluster and multi-tenant environments.
Did you find something suspicious?
Posted By
Posted in
DevOps / SRE
Functional Area
DevOps / Cloud
Job Code
1584171
Interview Questions for you
View All