Posted on: 30/04/2026
Job Description :
We are looking for an experienced Kubernetes Platform Engineer with strong expertise in Gardener to manage and support large-scale Kubernetes environments. This role involves troubleshooting complex cluster issues, optimizing platform configurations, and ensuring high availability and performance across distributed systems.
Key Responsibilities :
- Diagnose and resolve issues across Gardener control planes and managed clusters
- Handle incidents related to provisioning, scaling, and cluster upgrades
- Perform deep root cause analysis and document findings for critical incidents
- Manage lifecycle of Kubernetes clusters including deployment, upgrades, and maintenance
- Work with shoot and seed clusters within the Gardener ecosystem
- Ensure cluster stability, performance, and scalability
- Review and improve platform configurations for better efficiency and reliability
- Identify bottlenecks and implement performance improvements
- Support integration between Gardener components, underlying OS (Garden Linux), and virtualization layers (KVM-based environments)
- Collaborate with infrastructure teams to maintain seamless platform operations
- Use monitoring tools to track system health and performance
- Analyze logs and metrics to proactively identify issues
- Prepare detailed incident reports and root cause analysis (RCA) documents
- Create best practice guidelines and operational documentation
- Conduct knowledge transfer sessions within the team
Required Skills & Expertise :
- Strong understanding of Kubernetes internals (control plane, scheduling, networking)
- Hands-on experience managing production-grade Kubernetes clusters
- Experience working with Gardener architecture
- Knowledge of shoot and seed cluster operations
- Familiarity with cluster lifecycle management and troubleshooting
- Experience with monitoring tools such as Prometheus and Perses
- Ability to analyze metrics and logs for troubleshooting
- Understanding of Linux-based systems
- Experience with virtualization technologies (KVM preferred)
- Strong debugging and analytical skills
- Experience conducting root cause analysis and post-incident reviews
Key Deliverables :
- Detailed incident resolution reports
- Root Cause Analysis (RCA) documentation for major issues
- Configuration optimization recommendations
- Best practices and operational documentation
Candidate Profile :
- Strong ownership and accountability in handling critical incidents
- Ability to work in distributed/remote environments
- Good communication and collaboration skills
- Proactive approach to system stability and performance
Did you find something suspicious?
Posted by
Posted in
DevOps / SRE
Functional Area
DevOps / Cloud
Job Code
1632498