Posted on: 19/11/2025
Description :
Role Summary :
Build and operate the production backbone that takes models from Applied Sciences (AS) and delivers reliable, low-latency ML services across DMS, CRM, Digital Retail, Service, Payments, and enterprise products. Youll own pipelines, microservices, CI/CD, observability, and runtime reliability - working hand-in-hand with Applied Sciences and Product to turn ideas into measurable dealer and consumer impact
Why this Role Matters :
- Accelerate the rollout of LLM-powered and agent-driven features across products.
- Enable agentic workflows that automate, reason, and interact on behalf of users and internal stakeholders.
- Operationalize secure, compliant, and explainable LLM and agentic services at scale.
- Convert Applied Sciences models into scalable, compliant, cost efficient production services.
- Standardize how models are trained, validated, deployed, and monitored across products.
- Power real-time, context-aware experiences by integrating batch/stream features, graph context, and online inference.
What Youll Do :
- Turn Applied Sciences prototype models (tabular, NLP/LLM, recommendation, forecasting) into fast, reliable services with well-defined API contracts.
- Integrate with the LLM Gateway/MCP, prompt/config versioning.
- Build and orchestrate CI/CD pipelines.
- Review data science models; refactor and optimize code; containerize; deploy; version; and monitor for quality.
- Collaborate with data scientists, data engineers, product managers, and architects to design enterprise systems.
- Monitor, detect, and mitigate risks unique to LLMs and agentic systems.
- Implement prompt management : versioning, A/B testing, guardrails, and dynamic orchestration based on feedback and metrics.
- Design batch/stream pipelines (Airflow/Kubeflow, Spark/Flink, Kafka) and online features linked to our domain graph.
- Build inference microservices (REST/gRPC) with schema versioning, structured outputs, and stringent p95 latency targets.
- Manage the model/feature lifecycle : feature store strategy, model/agent registry, versioning, and lineage.
- Instrument deep observability : traces/logs/metrics, data/feature drift, model performance, safety signals, and cost tracking.
- Ensure real-time reliability : autoscaling, caching, circuit breakers, retries/fallbacks, and graceful degradation.
- Develop templates/SDKs/CLIs, sandbox datasets, and documentation that make shipping ML the default path.
Desired Skills and Experience :
- 5 - 14 years in ML engineering/MLOps or backend/platform engineering with production ML.
- Experience with LLMs, retrieval systems, vector stores, and graph/knowledge stores.
- Strong software engineering fundamentals : Python plus one of Java/Go/Scala; API design; concurrency; testing.
- Hands-on with orchestration frameworks and libraries (LangChain, LlamaIndex, OpenAI Function Calling, AgentKit, etc.).
- Knowledge of agent architectures (reactive, planning, retrieval-augmented agents), and safe execution patterns.
- Pipelines and data : Airflow/Kubeflow or similar; Spark/Flink; Kafka/Kinesis; strong data quality practices.
- Microservices and runtime : Docker/Kubernetes, service meshes, REST/gRPC; performance and reliability engineering.
- Model ops : experiment tracking, registries (e.g., MLflow), feature stores, A/B and shadow testing, drift detection.
- Observability : OpenTelemetry/Prometheus/Grafana; debugging latency, tail behavior, and memory/CPU hotspots.
- Cloud : AWS preferred (IAM, ECS/EKS, S3, RDS/DynamoDB, Step Functions/Lambda), with cost optimization experience.
- Security/compliance : secrets management, RBAC/ABAC, PII handling, auditability.
Preferred Mindset :
- Product-oriented : You measure success by dealer and consumer outcomes, not just technical metrics.
- Reliability- and safety-first : You move fast with guardrails, rollbacks, and clear SLOs.
- Systems thinker : You design for multi-tenant scale, portability, and cost efficiency.
- Collaborative : You translate between Applied Sciences, Product, and the Data & AI Platform; you document and teach.
- Pragmatic : You automate the 80% and leave room for rapid experimentation
Did you find something suspicious?