Example 07: Stack Overflow Multi-Model Database¶
This example demonstrates a complete multi-model workflow using the Stack Overflow data dump. It combines Documents, Graph, and Vectors into a single cohesive system, showcasing ArcadeDB's true multi-model capabilities.
🎯 Goals¶
- Phase 1: Document Import: Import XML data (Posts, Users, Tags, Comments, Votes) into ArcadeDB documents.
- Phase 2: Graph Creation: Build a graph by creating edges between documents.
- Phase 3: Vector Embeddings: Generate embeddings for Posts and Tags to enable semantic search.
- Phase 4: Analytics: Perform complex analytics using SQL, OpenCypher, and Vector Search.
📊 Dataset¶
The example supports multiple dataset sizes from the Stack Exchange Data Dump.
| Dataset | Size (XML) | Records | Recommended Heap |
|---|---|---|---|
Tiny (cs.stackexchange.com) |
~34 MB | ~100K | 2 GB |
Small (stats.stackexchange.com) |
~642 MB | ~1.5M | 8 GB |
Medium (stackoverflow.com subset) |
~2.9 GB | ~5M | 32 GB |
Large (stackoverflow.com full) |
~323 GB | ~350M | 64+ GB |
🚀 Usage¶
Run the example from the examples/ directory:
cd bindings/python/examples
# Run all phases with the small dataset
python 07_stackoverflow_multimodel.py --dataset stackoverflow-small --phases 1 2 3 4
# Run only Phase 1 (Import)
python 07_stackoverflow_multimodel.py --dataset stackoverflow-small --phases 1
🏗️ Architecture¶
Phase 1: Document Import (XML → Documents)¶
We use lxml for streaming XML parsing to handle large files efficiently.
Schema:
- Post:
Id,Title,Body,Score,ViewCount,CreationDate, ... - User:
Id,DisplayName,Reputation,AboutMe, ... - Tag:
Id,TagName,Count, ... - Comment:
Id,PostId,UserId,Text, ... - Vote:
Id,PostId,UserId,VoteTypeId, ...
Key Techniques:
- Streaming Parse: Processes XML elements one by one to keep memory usage low.
- Batch Insert: Uses
BatchContextfor high-performance insertion. - Type Conversion: Handles nullable fields and type mismatches (e.g.,
IntegervsString).
Phase 2: Graph Creation (Documents → Graph)¶
We transform the document store into a graph by creating relationships.
Edges:
User -[ASKED]-> Post(Question)User -[ANSWERED]-> Post(Answer)Post -[ANSWER_TO]-> Post(Answer links to Question)Post -[HAS_TAG]-> TagUser -[COMMENTED]-> PostUser -[VOTED]-> Post
Key Techniques:
- RID-Based Pagination: Efficiently iterates through millions of records using
@rid > last_rid. - Index-Based Lookups: Uses
lookupByKey(O(1)) instead of SQLIN(O(N)) for massive speedups in vertex resolution. - Nested Queries: Prevents data loss during pagination by decoupling the scan (RID-based) from the filter (SQL-based).
Phase 3: Vector Embeddings (Graph → Vectors)¶
We add a semantic layer by generating embeddings for text content.
Embeddings:
- Post:
Title+Body→ 384-dimensional vector. - Tag:
TagName→ 384-dimensional vector.
Key Techniques:
- JVector: Uses state-of-the-art graph-based indexing (HNSW + DiskANN/Vamana) for fast approximate nearest neighbor search.
- Sentence Transformers: Uses
all-MiniLM-L6-v2(or similar) to generate high-quality embeddings.
Phase 4: Analytics (Multi-Model Queries)¶
We demonstrate the power of combining all three models.
Example Queries:
- Graph: "Find the top 10 users who answered questions about 'python'."
- Vector: "Find questions semantically similar to 'How to parse XML in Python?'"
- Hybrid: "Find similar questions (Vector) that have a score > 10 (Document) and were asked by high-reputation users (Graph)."
💡 Key Learnings¶
- Pagination Matters: When iterating over millions of records, standard
OFFSET/LIMITis too slow. Use RID-based pagination (WHERE @rid > :last_rid LIMIT :batch_size). - Filter Carefully: When combining RID pagination with SQL filters, use nested queries to ensure you don't skip records due to sparse matches.
- Index Lookups: For graph creation, always use index lookups (
lookupByKey) to find vertices. It is orders of magnitude faster than SQL queries. - Batching: Always use batching (
BatchContext) for imports and updates to minimize transaction overhead.