HamburgerMenu
hirist

Job Description

Description :

We are seeking a passionate and hands-on AI/ML Engineer to accelerate our Enterprise Observability strategy.

This role will design, build, and operationalize AI/ML capabilities that enhance end to end telemetry pipelines, anomaly detection, intelligent alerting, and proactive system resiliency.

You will work at the intersection of AI/ML engineering, Observability platforms, and automation, developing solutions that improve detection, diagnosis, and prevention of operational issues across distributed systems.

Key Responsibilities :

- Design and deploy AI/ML models supporting anomaly detection, baselining, event correlation, and predictive operational analytics.

- Build and integrate AI enabled capabilities into enterprise Observability platforms, including Grafana, APM/RUM tools, network telemetry systems, and data observability tools.

- Develop AI Agents that can autonomously triage issues, recommend corrective actions, and initiate automated remediation workflows to reduce recovery time and improve system resilience.

- Implement self-healing automation using AI-driven decisioning, integrating with orchestration frameworks, service APIs, and infrastructure automation pipelines.

- Engineer and maintain real-time and batch data pipelines using Snowflake ML Jobs, Snowflake Cortex, streams, tasks, and UDFs.

- Implement and manage OpenTelemetry based telemetry ingestion for logs, metrics, traces, and spans across distributed systems.

- Build asynchronous Python APIs and services for model inferencing and operational integration.

- Enhance observability intelligence with AI-powered capabilities such as root-cause acceleration, chatbot/search enablement, and automated insights.

- Contribute to SLO/SLI modeling, Golden Signals instrumentation, and Observability NFR adoption.

- Collaborate across engineering, SRE, platform and business teams to embed proactive intelligence and Observability standards throughout the ecosystem.

Required Skills & Qualifications :

Core Technical Skills :

- Strong proficiency in Python and data science/ML libraries : NumPy, Pandas, scikit learn, TensorFlow, PyTorch, Matplotlib, Seaborn.

- Experience with Generative AI, LLM fine tuning, prompt engineering, RAG pipelines, and LLM evaluation frameworks.

- Expertise in developing and deploying ML models in production (batch & streaming).

- Strong understanding of statistics, time series modeling, and anomaly detection.

Observability & Telemetry :

- Experience with OpenTelemetry for logs, metrics, traces, spans.

- Familiarity with Observability concepts : Golden Signals, SLO/SLI design, APM, RUM, Synthetics, event correlation, baselining.

- Experience with Observability tools such as Grafana (Alloy agents, dashboards, ML capabilities), Dynatrace, Monte Carlo (Data Observability), Netscout, ThousandEyes, SolarWinds, NetBrain.

Cloud, Data & Platform :

- Hands on with AWS (SageMaker, Bedrock), Snowflake ML, Snowflake/Openflow, Snowflake AI Observability tooling.

- Experience building Snowflake data pipelines (streams, tasks, UDFs) plus for Cortex features.

- Strong understanding of distributed systems and microservices telemetry requirements.

Automation & Engineering Quality :

- Experience with automation pipelines, CI/CD, and infrastructure as code patterns supporting Observability adoption.

- Ability to build asynchronous Python APIs or services for model inference and operational integration.

Preferred Qualifications :

- Experience developing agentic AI systems that analyze telemetry, generate action recommendations, or execute automated operational responses.

- Experience building self- healing patterns, including automated rollback, service restarts, configuration corrections, and predictive maintenance.

- Experience in Snowflake ML workflows, Snowflake Cortex Agents, and data pipeline automation.

- Exposure to AI-enabled alerting, RCA automation, and operational self- healing concepts.

- Experience with large-scale operational telemetry and multi-cloud ecosystems.

Soft Skills :

- Strong analytical thinking and problem solving.

- Excellent communication skills for cross functional collaboration with infrastructure, SRE, engineering, business, and leadership teams.

- Curiosity, continuous learning mindset, and passion for applied AI and Observability.


info-icon

Did you find something suspicious?

Similar jobs that you might be interested in