Build and operate the production backbone that takes models from Applied Sciences (AS) and delivers reliable, low-latency ML services across DMS, CRM, Digital Retail, Service, Payments, and enterprise products. Youll own pipelines, microservices, CI/CD, observability, and runtime reliability - working hand-in-hand with Applied Sciences and Product to turn ideas into measurable dealer and consumer impact

Why this Role Matters :

- Accelerate the rollout of LLM-powered and agent-driven features across products.

- Enable agentic workflows that automate, reason, and interact on behalf of users and internal stakeholders.

- Operationalize secure, compliant, and explainable LLM and agentic services at scale.

- Convert Applied Sciences models into scalable, compliant, cost efficient production services.

- Standardize how models are trained, validated, deployed, and monitored across products.

- Power real-time, context-aware experiences by integrating batch/stream features, graph context, and online inference.

What Youll Do :

- Turn Applied Sciences prototype models (tabular, NLP/LLM, recommendation, forecasting) into fast, reliable services with well-defined API contracts.

- Integrate with the LLM Gateway/MCP, prompt/config versioning.

- Build and orchestrate CI/CD pipelines.

- Review data science models; refactor and optimize code; containerize; deploy; version; and monitor for quality.

- Collaborate with data scientists, data engineers, product managers, and architects to design enterprise systems.

- Monitor, detect, and mitigate risks unique to LLMs and agentic systems.

- Implement prompt management : versioning, A/B testing, guardrails, and dynamic orchestration based on feedback and metrics.

- Design batch/stream pipelines (Airflow/Kubeflow, Spark/Flink, Kafka) and online features linked to our domain graph.

- Build inference microservices (REST/gRPC) with schema versioning, structured outputs, and stringent p95 latency targets.

- Manage the model/feature lifecycle : feature store strategy, model/agent registry, versioning, and lineage.

- Instrument deep observability : traces/logs/metrics, data/feature drift, model performance, safety signals, and cost tracking.

- Ensure real-time reliability : autoscaling, caching, circuit breakers, retries/fallbacks, and graceful degradation.

- Develop templates/SDKs/CLIs, sandbox datasets, and documentation that make shipping ML the default path.

Desired Skills and Experience :

- 5 - 14 years in ML engineering/MLOps or backend/platform engineering with production ML.

- Experience with LLMs, retrieval systems, vector stores, and graph/knowledge stores.

- Strong software engineering fundamentals : Python plus one of Java/Go/Scala; API design; concurrency; testing.

- Hands-on with orchestration frameworks and libraries (LangChain, LlamaIndex, OpenAI Function Calling, AgentKit, etc.).

- Knowledge of agent architectures (reactive, planning, retrieval-augmented agents), and safe execution patterns.

- Pipelines and data : Airflow/Kubeflow or similar; Spark/Flink; Kafka/Kinesis; strong data quality practices.

- Microservices and runtime : Docker/Kubernetes, service meshes, REST/gRPC; performance and reliability engineering.

- Model ops : experiment tracking, registries (e.g., MLflow), feature stores, A/B and shadow testing, drift detection.

- Observability : OpenTelemetry/Prometheus/Grafana; debugging latency, tail behavior, and memory/CPU hotspots.

- Cloud : AWS preferred (IAM, ECS/EKS, S3, RDS/DynamoDB, Step Functions/Lambda), with cost optimization experience.

- Security/compliance : secrets management, RBAC/ABAC, PII handling, auditability.

Preferred Mindset :

- Product-oriented : You measure success by dealer and consumer outcomes, not just technical metrics.

- Reliability- and safety-first : You move fast with guardrails, rollbacks, and clear SLOs.

- Systems thinker : You design for multi-tenant scale, portability, and cost efficiency.

- Collaborative : You translate between Applied Sciences, Product, and the Data & AI Platform; you document and teach.

- Pragmatic : You automate the 80% and leave room for rapid experimentation