Posted on: 27/02/2026
Description :
Key Responsibilities :
- Vector & Graph ETL : Design and maintain pipelines that transform unstructured data (PDFs, emails, logs, chats) into optimized embeddings for Vector Databases (Pinecone, Weaviate, Milvus).
- Semantic Data Modeling : Engineer data structures that optimize for Retrieval-Augmented Generation (RAG), ensuring agents find the "needle in the haystack" in milliseconds.
- Knowledge Graph Construction : Build and scale Knowledge Graphs (Neo4j) to represent complex relationships in our trading and support data that standard vector search misses.
- Automated Data Labeling & Synthetic Data : Implement pipelines using LLMs to auto-label datasets or generate synthetic edge cases for agent training and evaluation.
- Stream Processing for Agents : Build real-time data "listeners" (Kafka/Flink) that feed live context to agents, allowing them to react to market or support events as they happen.
- Data Reliability & "Drift" Detection : Build monitoring for "Embedding Drift", identifying when the statistical distribution of your data changes and the agent's "knowledge" becomes stale.
Qualifications :
- Vector Database Mastery : Expert-level configuration of HNSW indexes, scalar quantization, and metadata filtering strategies within Pinecone, Milvus, or Qdrant.
- Advanced Python & Rust : Proficiency in Python for AI logic and Rust (or C++) for high-performance data processing and custom embedding functions.
- Big Data Ecosystem : Hands-on experience with Apache Spark, Flink, and Kafka in a high-throughput environment (Trading/FinTech preferred).
- LLM Data Tooling : Deep experience with Unstructured.io, LlamaIndex, or LangChain for document parsing and chunking strategy optimization.
- MLOps & DataOps : Mastery of DVC (Data Version Control) and Airflow/Prefect for managing complex, non-linear AI data workflows.
- Embedding Models : Understanding of how to fine-tune embedding models (e.g., BGE, Cohere, or OpenAI) to better represent domain-specific (Trading) terminology.
Additional qualifications :
- Chunking Strategy Architect : You don't just "split text." You implement Semantic Chunking and Parent-Child retrieval strategies to maximize LLM context relevance.
- Cold/Warm/Hot Storage Strategy : Managing cost and latency by tiering data between Vector DBs (Hot), SQL/NoSQL (Warm), and S3/Data Lakes (Cold).
- Privacy & Redaction Pipelines : Building automated PII (Personally Identifiable Information) redaction into the ingestion layer to ensure agents never "see" or "leak" sensitive user data.
Did you find something suspicious?
Posted by
Posted in
Data Engineering
Functional Area
ML / DL / AI Research
Job Code
1616693