Posted on: 06/11/2025
Description :
About the Role :
We are hiring a Kubernetes AI/ML Ops Observability Engineer to build, monitor, and optimize AI infrastructure. This role combines expertise in Kubernetes, observability stacks, and AI/ML orchestration tools such as LangFuse, LangServe, and LangGraph.
Key Responsibilities :
- Manage Kubernetes-based AI/ML infrastructure, ensuring reliability and scalability.
- Implement observability solutions using Prometheus, Grafana, ELK Stack, and tracing tools.
- Monitor system health and automate alerts using Datadog and PagerDuty.
- Support deployment and monitoring of AI/ML pipelines integrated with LangServe, LangGraph, and LangFuse.
- Develop automation scripts using Python and manage infrastructure via Terraform.
- Collaborate with DevOps, ML engineers, and data teams to maintain system uptime and performance.
Required Skills :
- Strong expertise in Kubernetes, Linux, and Python scripting.
- Experience with observability tools (Prometheus, Grafana, ELK, Datadog, PagerDuty).
- Familiarity with LangServe, LangGraph, LangFuse, and modern MLOps ecosystems.
- Hands-on experience in DevOps, tracing, and performance monitoring of distributed systems.
Did you find something suspicious?