Example 13: Stack Overflow Hybrid Queries¶

View source code

This example builds a standalone Stack Overflow pipeline combining document tables, property graph data, embeddings, vector indexes, hybrid queries, and a derived time-series activity layer.

Overview¶

Example 13 runs five phases:

XML to document tables
XML to graph
Embeddings plus vector indexes on Question, Answer, and Comment
Hybrid queries combining SQL, OpenCypher, and vector search
Derived time-series analytics from existing timestamped events

Current Repository Guidance¶

The script is intentionally standalone and does not use Docker
Phase 1 uses async SQL insert for document-table preload
Keep that Phase 1 preload on a single async worker; do not rely on multi-threaded async insert for this workload in the current Python examples
Phase 2 uses GraphBatch for the initial graph node and edge load
GraphBatch is the repository's recommended bulk graph ingest path from Python
Graph edge creation uses RID-based directed endpoints
Traversal semantics are directional
Phase 5 does not replace the document or graph models; it builds a compact ActivitySeries projection from existing CreationDate / Date fields

Run¶

From bindings/python/examples:

python 13_stackoverflow_hybrid_queries.py \
  --dataset stackoverflow-tiny \
  --batch-size 10000 \
  --encode-batch-size 256 \
  --model all-MiniLM-L6-v2 \
  --heap-size 4g \
  --top-k 10 \
  --timeseries-top-tags 5

Key Options¶

--dataset: dataset size from stackoverflow-tiny through stackoverflow-full
--batch-size: load batch size
--encode-batch-size: embedding batch size
--model: sentence-transformers model name
--heap-size: JVM heap size
--top-k: hybrid/vector result count
--candidate-limit: candidate pool size for hybrid ranking
--timeseries-top-tags: number of top tags projected into the derived time-series activity layer

Time-Series Layer¶

The script now adds a compact derived TimeSeries type after the main hybrid phases:

ActivitySeries
timestamp: daily bucket timestamp
tags: event_type, scope, tag
fields: event_count, total_score, avg_score

The source data comes from the timestamps already present on:

Question.CreationDate
Answer.CreationDate
Comment.CreationDate
Badge.Date

That means the example stays faithful to the Stack Overflow domain model while also demonstrating how ArcadeDB can project graph/document events into a time-series view for trend analysis.