Example 13: Stack Overflow Hybrid Queries¶
This example builds a standalone Stack Overflow pipeline combining document tables, property graph data, embeddings, vector indexes, hybrid queries, and a derived time-series activity layer.
Overview¶
Example 13 runs five phases:
- XML to document tables
- XML to graph
- Embeddings plus vector indexes on Question, Answer, and Comment
- Hybrid queries combining SQL, OpenCypher, and vector search
- Derived time-series analytics from existing timestamped events
Current Repository Guidance¶
- The script is intentionally standalone and does not use Docker
- Phase 1 uses async SQL insert for document-table preload
- Keep that Phase 1 preload on a single async worker; do not rely on multi-threaded async insert for this workload in the current Python examples
- Phase 2 uses
GraphBatchfor the initial graph node and edge load GraphBatchis the repository's recommended bulk graph ingest path from Python- Graph edge creation uses RID-based directed endpoints
- Traversal semantics are directional
- Phase 5 does not replace the document or graph models; it builds a compact
ActivitySeriesprojection from existingCreationDate/Datefields
Run¶
From bindings/python/examples:
python 13_stackoverflow_hybrid_queries.py \
--dataset stackoverflow-tiny \
--batch-size 10000 \
--encode-batch-size 256 \
--model all-MiniLM-L6-v2 \
--heap-size 4g \
--top-k 10 \
--timeseries-top-tags 5
Key Options¶
--dataset: dataset size fromstackoverflow-tinythroughstackoverflow-full--batch-size: load batch size--encode-batch-size: embedding batch size--model: sentence-transformers model name--heap-size: JVM heap size--top-k: hybrid/vector result count--candidate-limit: candidate pool size for hybrid ranking--timeseries-top-tags: number of top tags projected into the derived time-series activity layer
Time-Series Layer¶
The script now adds a compact derived TimeSeries type after the main hybrid phases:
ActivitySeries- timestamp: daily bucket timestamp
- tags:
event_type,scope,tag - fields:
event_count,total_score,avg_score
The source data comes from the timestamps already present on:
Question.CreationDateAnswer.CreationDateComment.CreationDateBadge.Date
That means the example stays faithful to the Stack Overflow domain model while also demonstrating how ArcadeDB can project graph/document events into a time-series view for trend analysis.