- Design and develop agentic AI systems that observe platform state, reason over signals, and take controlled actions
- Build AI-driven agents for :
1. GPU resource optimization and scheduling
2. Database performance analysis and anomaly detection
3. Platform health monitoring and automated remediation
- Integrate LLM-based agents with platform APIs, orchestration systems, and observability data
- Define safety guardrails, approval workflows, and rollback mechanisms for agent actions
- Continuously improve agent behavior using feedback loops and real-world signals
This role focuses on agentic AI operating on real infrastructure and data platforms, not chatbot or prompt-only applications.
Required Skills & Experience :
- 8+ years of experience in infrastructure, platform, or systems engineering
- Experience with Kubernetes and containerized workloads
- Proficiency in automation and scripting (Python, Bash, Go, or similar)
- Experience building automation or autonomous systems that interact with real infrastructure
Preferred Skills :
- Experience in Go or Python and building Microservices for Multitenant systems
- Understanding of AI safety, guardrails, and human-in-the-loop designs
This role combines GPU-as-a-Service (GPUaaS), advanced database performance engineering, and agentic AI systems that observe, reason, and act on real infrastructure. You will help build platforms that are not only scalable and reliable, but also self-optimizing and increasingly autonomous.