Dataset Downloader¶
Use download_data.py to fetch and prepare datasets for the examples. Run it from the examples/ directory so paths resolve correctly.
Quick Start¶
All datasets are stored under bindings/python/examples/data/.
Supported Datasets¶
- MovieLens:
movielens-small,movielens-large - Stack Exchange:
stackoverflow-tiny,stackoverflow-small,stackoverflow-medium,stackoverflow-large,stackoverflow-xlarge,stackoverflow-full - MSMARCO v2.1:
msmarco-1m,msmarco-5m,msmarco-10m
Usage¶
python download_data.py movielens-large
python download_data.py stackoverflow-tiny
python download_data.py stackoverflow-small
python download_data.py stackoverflow-small --no-vectors
python download_data.py stackoverflow-small --vector-model all-MiniLM-L6-v2
python download_data.py stackoverflow-small --vector-batch-size 128
python download_data.py stackoverflow-small --vector-shard-size 100000
python download_data.py stackoverflow-small --vector-max-rows 50000
python download_data.py stackoverflow-small --vector-gt-queries 1000 --vector-gt-topk 50
python download_data.py stackoverflow-large
python download_data.py stackoverflow-xlarge
python download_data.py stackoverflow-full
python download_data.py msmarco-1m
Notes¶
- MovieLens NULL injection is enabled by default (use
--no-nullsto skip). - Stack Exchange vectors are generated by default for questions, answers, and comments.
The script also builds a combined
allcorpus and computes exact ground truth (.gt.jsonl) from sampled queries, MSMARCO-style. Use--no-vectorsto skip. - For Stack Exchange datasets, the script emits a copy-friendly entity count block after run
(
User,Post,Comment,Badge,Vote,PostLink,Tag,PostHistory,Total) so you can paste directly into markdown. - MSMARCO downloads parquet shards and converts them to vector shards with a ground-truth file.
Dependencies¶
Install only what you need for the datasets you plan to download:
- Stack Exchange:
py7zr - Stack Exchange vectors:
sentence-transformers,torch,numpy - MSMARCO:
huggingface_hub,numpy,pyarrow
Output Locations¶
- MovieLens:
examples/data/movielens-<size>/ - Stack Exchange:
examples/data/stackoverflow-<size>/ - Stack Exchange vectors:
examples/data/stackoverflow-<size>/vectors/- Includes per-corpus files (
questions,answers,comments) and combinedallfiles
- Includes per-corpus files (
- MSMARCO:
examples/data/MSMARCO-<size>/
Formats & Schemas¶
- MovieLens: CSV files, no schema file generated.
- Stack Exchange: XML files, no schema file generated.
- Stack Exchange vectors: binary vector shards (
.f32) plus.meta.json,.ids.jsonl, and combined.gt.jsonl.- Per-corpus outputs:
stackoverflow-<size>-questions|answers|comments.{meta.json,ids.jsonl,shard*.f32} - Combined outputs:
stackoverflow-<size>-all.meta.json,stackoverflow-<size>-all.ids.jsonl,stackoverflow-<size>-all.gt.jsonl - Defaults for combined GT:
1000sampled queries,topk=50(configurable via--vector-gt-queriesand--vector-gt-topk) - Vectors are 384-D, L2-normalized (all-MiniLM-L6-v2).
- Per-corpus outputs:
- MSMARCO: binary vector shards (
.f32) plus.meta.jsonand.gt.jsonl.- Vectors are 1024‑D; 1M/5M/10M indicate the number of vectors.
Stack Overflow (sizes & counts)¶
Dataset sizes:
- stackoverflow-tiny: ~34 MB disk (subset of small)
- stackoverflow-small: ~642 MB disk
- stackoverflow-medium: ~2.9 GB disk
- stackoverflow-large: ~10 GB disk (subset of full)
- stackoverflow-xlarge: ~50 GB disk (subset of full)
- stackoverflow-full: ~323 GB disk
Expected document counts:
stackoverflow-tiny
- User: 10,000
- Post: 10,000
- Comment: 10,000
- Badge: 10,000
- Vote: 10,000
- PostLink: 10,000
- Tag: 668
- PostHistory: 10,000
- Total: 70,668
stackoverflow-small
- User: 138,727
- Post: 105,373
- Comment: 195,781
- Badge: 182,975
- Vote: 411,166
- PostLink: 11,005
- Tag: 668
- PostHistory: 360,340
- Total: 1,406,035
stackoverflow-medium
- User: 345,754
- Post: 425,735
- Comment: 819,648
- Badge: 612,258
- Vote: 1,747,225
- PostLink: 86,919
- Tag: 1,612
- PostHistory: 1,525,713
- Total: 5,564,864
stackoverflow-large
- User: 661,594
- Post: 2,738,307
- Comment: 2,723,828
- Badge: 1,657,162
- Vote: 7,691,408
- PostLink: 204,690
- Tag: 1,925
- PostHistory: 6,970,840
- Total: 22,649,754
stackoverflow-xlarge
- User: 3,620,888
- Post: 12,150,753
- Comment: 13,800,490
- Badge: 8,283,261
- Vote: 38,042,985
- PostLink: 1,023,432
- Tag: 9,959
- PostHistory: 31,269,534
- Total: 108,201,302
stackoverflow-full
- User: 22,484,235
- Post: 59,819,048
- Comment: 90,380,323
- Badge: 51,289,973
- Vote: 238,984,011
- PostLink: 6,552,590
- Tag: 65,675
- PostHistory: 160,790,317
- Total: 630,366,172
Stack Overflow vector counts (combined all corpus)¶
These counts come from generated vectors/stackoverflow-<size>-all.meta.json files (count field):
| Dataset | Vectors (count) |
|---|---|
| stackoverflow-tiny | 19,591 |
| stackoverflow-small | 300,424 |
| stackoverflow-medium | 1,242,391 |
| stackoverflow-large | 5,461,227 |
| stackoverflow-xlarge | 25,910,526 |
| stackoverflow-full | 150,063,678 |
Approximate Sizes¶
| Dataset | Dataset size (non-vectors) | Vector size |
|---|---|---|
| MovieLens small | ~3.2 MB | — |
| MovieLens large | ~1.5 GB | — |
| MSMARCO 1M | — | ~3.9 GB |
| MSMARCO 5M | — | ~20 GB |
| MSMARCO 10M | — | ~39 GB |
| StackOverflow tiny | ~34 MB | ~35 MB |
| StackOverflow small | ~642 MB | ~495 MB |
| StackOverflow medium | ~2.9 GB | ~2.0 GB |
| StackOverflow large | ~10 GB | ~8.8 GB |
| StackOverflow xlarge | ~50 GB | ~42 GB |
| StackOverflow full | ~323 GB | — |