Posted on: 30/12/2025
Description :
Role : Principal / Senior Cloud Engineer Kubernetes (AI Infrastructure focus)
Experience : 68 Years
Role Summary :
The Principal / Senior Cloud Engineer is a critical technical leadership role responsible for architecting the high-availability infrastructure that powers mission-critical software and AI applications.
This role demands a mastery of the Kubernetes (K8s) ecosystem, specifically the development of custom K8s Operators and Custom Resource Definitions (CRDs) to automate complex lifecycle management.
You will bridge the gap between core infrastructure and the emerging Agentic AI landscape, ensuring that autonomous agent clusters and RAG pipelines are scalable, resilient, and performant.
Leveraging Golang, you will build self-healing systems capable of automatic failover, rolling upgrades, and vertical/horizontal scaling, while optimizing the underlying Linux environment for peak OS-level performance.
Responsibilities :
- Kubernetes Control Plane & Operators : Design and build custom Kubernetes Operators using Golang to manage the lifecycle of Product CRDs, ensuring automated deployment and state management.
- Lifecycle Management : Orchestrate seamless rolling upgrades, automated backup/restore procedures, and robust failover mechanisms to maintain zero-downtime service availability.
- Scaling & Observability : Implement advanced horizontal and vertical scaling strategies, coupled with comprehensive monitoring and alerting systems to ensure cluster health.
- AI Infrastructure Orchestration : Architect and deploy Agentic AI systems and multi-step reasoning frameworks (LangGraph, AutoGen) within containerized environments.
- Data & Retrieval Pipelines : Support the infrastructure for Retrieval-Augmented Generation (RAG) pipelines and Vector Databases, ensuring low-latency data retrieval for LLM context.
- Integration & Communication : Build and deploy conversational chat applications for enterprise platforms like MS Teams or Slack, utilizing MCP server implementations for agent-to-agent communication.
- Distributed Systems Engineering : Manage and tune distributed components such as Kafka, Zookeeper, ETCD, or Consul to support high-scale event-driven architectures.
- Performance Tuning : Perform deep-dive OS-level performance tuning and multi-threaded programming to optimize resource utilization across the K8s cluster.
- Secure Automation : Develop and maintain Linux shell scripts for system automation, ensuring compliance with secure containerization and orchestration best practices.
Technical Requirements :
- Overall Experience : 6 to 8 years of professional engineering experience, with a deep focus on cloud-native infrastructure and distributed systems.
- Core Languages : Mandatory proficiency in Golang for backend and operator development, along with expertise in Python for AI library integration.
- Kubernetes Mastery : Hands-on experience building K8s Operators, managing Docker containers, and configuring complex K8s clusters.
- GenAI & Agentic AI : 2+ years of exposure to Generative AI models, prompt engineering, and frameworks like LangChain, LangGraph, or CrewAI.
- Cloud & Networking : Strong understanding of cloud computing (AWS/GCP/Azure) and high-performance protocols such as gRPC.
- Database & Storage : Hands-on experience with SQL, NoSQL, and Vector Databases (e.g., Pinecone, Milvus) for AI context engineering.
- Linux Expertise : Expert-level knowledge of Linux internals, shell scripting, and system-level performance optimization.
Preferred Skills :
- Agent Orchestration : Practical exposure to Model Context Protocol (MCP) server implementation for standardized agent-tool interaction.
- Reinforcement Learning : Familiarity with RLHF (Reinforcement Learning from Human Feedback) and LLM fine-tuning workflows.
- Distributed Coordination : Experience managing ETCD or Consul for distributed configuration and service discovery.
- Agile Leadership : Proven experience leading technical initiatives within an Agile/Scrum framework.
- Conversational AI : Experience building bot-based tool orchestration for enterprise productivity suites.
- Concurrency : Advanced knowledge of multi-threaded programming and concurrency patterns in Go.
Did you find something suspicious?
Posted by
Posted in
DevOps / SRE
Functional Area
DevOps / Cloud
Job Code
1595462