Posted on: 17/04/2026
Description :
We are seeking a passionate and hands-on AI/ML Engineer to accelerate our Enterprise Observability strategy.
This role will design, build, and operationalize AI/ML capabilities that enhance end to end telemetry pipelines, anomaly detection, intelligent alerting, and proactive system resiliency.
You will work at the intersection of AI/ML engineering, Observability platforms, and automation, developing solutions that improve detection, diagnosis, and prevention of operational issues across distributed systems.
Key Responsibilities :
- Design and deploy AI/ML models supporting anomaly detection, baselining, event correlation, and predictive operational analytics.
- Build and integrate AI enabled capabilities into enterprise Observability platforms, including Grafana, APM/RUM tools, network telemetry systems, and data observability tools.
- Develop AI Agents that can autonomously triage issues, recommend corrective actions, and initiate automated remediation workflows to reduce recovery time and improve system resilience.
- Implement self-healing automation using AI-driven decisioning, integrating with orchestration frameworks, service APIs, and infrastructure automation pipelines.
- Engineer and maintain real-time and batch data pipelines using Snowflake ML Jobs, Snowflake Cortex, streams, tasks, and UDFs.
- Implement and manage OpenTelemetry based telemetry ingestion for logs, metrics, traces, and spans across distributed systems.
- Build asynchronous Python APIs and services for model inferencing and operational integration.
- Enhance observability intelligence with AI-powered capabilities such as root-cause acceleration, chatbot/search enablement, and automated insights.
- Contribute to SLO/SLI modeling, Golden Signals instrumentation, and Observability NFR adoption.
- Collaborate across engineering, SRE, platform and business teams to embed proactive intelligence and Observability standards throughout the ecosystem.
Required Skills & Qualifications :
Core Technical Skills :
- Strong proficiency in Python and data science/ML libraries : NumPy, Pandas, scikit learn, TensorFlow, PyTorch, Matplotlib, Seaborn.
- Experience with Generative AI, LLM fine tuning, prompt engineering, RAG pipelines, and LLM evaluation frameworks.
- Expertise in developing and deploying ML models in production (batch & streaming).
- Strong understanding of statistics, time series modeling, and anomaly detection.
Observability & Telemetry :
- Experience with OpenTelemetry for logs, metrics, traces, spans.
- Familiarity with Observability concepts : Golden Signals, SLO/SLI design, APM, RUM, Synthetics, event correlation, baselining.
- Experience with Observability tools such as Grafana (Alloy agents, dashboards, ML capabilities), Dynatrace, Monte Carlo (Data Observability), Netscout, ThousandEyes, SolarWinds, NetBrain.
Cloud, Data & Platform :
- Hands on with AWS (SageMaker, Bedrock), Snowflake ML, Snowflake/Openflow, Snowflake AI Observability tooling.
- Experience building Snowflake data pipelines (streams, tasks, UDFs) plus for Cortex features.
- Strong understanding of distributed systems and microservices telemetry requirements.
Automation & Engineering Quality :
- Experience with automation pipelines, CI/CD, and infrastructure as code patterns supporting Observability adoption.
- Ability to build asynchronous Python APIs or services for model inference and operational integration.
Preferred Qualifications :
- Experience developing agentic AI systems that analyze telemetry, generate action recommendations, or execute automated operational responses.
- Experience building self- healing patterns, including automated rollback, service restarts, configuration corrections, and predictive maintenance.
- Experience in Snowflake ML workflows, Snowflake Cortex Agents, and data pipeline automation.
- Exposure to AI-enabled alerting, RCA automation, and operational self- healing concepts.
- Experience with large-scale operational telemetry and multi-cloud ecosystems.
Soft Skills :
- Strong analytical thinking and problem solving.
- Excellent communication skills for cross functional collaboration with infrastructure, SRE, engineering, business, and leadership teams.
- Curiosity, continuous learning mindset, and passion for applied AI and Observability.
Did you find something suspicious?