HamburgerMenu
hirist

Job Description

Description :


Role Overview :


We are looking for an experienced Kubernetes with strong expertise in Kubernetes clusters, cloud-native technologies, storage integration, and performance optimisation. The ideal candidate should have hands-on experience in designing, deploying, and managing large-scale Kubernetes environments across on-prem and cloud platforms, along with troubleshooting complex containerised workloads.


Key Responsibilities :


Cluster Management & Deployment :


- Provision and manage Kubernetes clusters using kubeadm, RKE2, and Cluster API across cloud platforms (AWS, Azure, GCP, OpenStack).


- Deploy, scale, and upgrade applications using Kubernetes best practices (rolling updates, probes, HPA, VPA).


- Configure node scheduling strategies using taints, tolerations, and affinity rules.


Application Deployment & Troubleshooting :


- Debug CrashLoopBackOff and pod failures using kubectl logs, events, and resource monitoring.


- Troubleshoot networking, persistent volumes, and service exposure issues (ClusterIP, NodePort, LoadBalancer, Ingress).


- Debug application routing using APISIX, NGINX ingress, and multi-path routing.


- Handle application scaling and high-traffic scenarios using autoscalers.


Storage & Data Management :


- Integrate Ceph storage with Kubernetes via CSI drivers for block and filesystem provisioning.


- Troubleshoot PersistentVolume (PV) and PersistentVolumeClaim (PVC) issues.


Observability & Performance :


- Deploy and configure monitoring solutions such as Prometheus and Metrics Server.


- Benchmark cluster and workload performance (CPU, memory, networking).


- Enable log collection and analysis for multi-container pods.


Security & Networking :


- Manage authentication and RBAC policies within Kubernetes.


- Configure isolation for virtual Kubernetes clusters (vcluster).


- Handle registry authentication (AWS ECR, private registries) using image pull secrets.


Specialized Workloads :


- Deploy and manage GPU workloads using NVIDIA GPU Operator.


- Enable GPU scheduling and resource allocation for AI/ML workloads.


Operations & Maintenance :


- Troubleshoot faulty nodes (on-prem / cloud) including CPU, memory, disk, and kubelet health.


- Work on service routing, ingress configurations, and debugging cloud load balancer/firewall issues.


- Perform rolling upgrades and ensure zero-downtime deployments.


Required Skills :


- Strong expertise in Kubernetes administration and cloud-native deployments.


- Hands-on experience with kubeadm, RKE2, Cluster API, and Terraform for cluster provisioning.


- Knowledge of storage integration with Ceph and CSI drivers.


- Experience with monitoring and observability tools (Prometheus, Grafana, Metrics Server).


- Strong debugging skills for pod crashes, networking issues, and persistent storage problems.


- Knowledge of NGINX ingress, APISIX, and traffic routing.


- Understanding of RBAC, security groups, and IAM policies in Kubernetes & cloud.


- Experience with GPU workloads in Kubernetes.


- Familiarity with CI/CD pipelines for Kubernetes deployments is a plus.


Preferred Qualifications :


- 4+ years of hands-on experience in Kubernetes roles.


- Experience in both managed (EKS, AKS, GKE) and on-prem Kubernetes clusters.


- Strong scripting skills (Bash, Python, Go preferred).


- Prior experience with infrastructure-as-code tools like Terraform, Helm, and Ansible.


- Exposure to multi-cluster and multi-tenant environments.


info-icon

Did you find something suspicious?