Skip to content

Dataset Downloader

View source code

Use download_data.py to fetch and prepare datasets for the examples. Run it from the examples/ directory so paths resolve correctly.

Quick Start

cd bindings/python/examples
python download_data.py movielens-small

All datasets are stored under bindings/python/examples/data/.

Supported Datasets

  • MovieLens: movielens-small, movielens-large
  • Stack Exchange: stackoverflow-tiny, stackoverflow-small, stackoverflow-medium, stackoverflow-large, stackoverflow-xlarge, stackoverflow-full
  • MSMARCO v2.1: msmarco-1m, msmarco-5m, msmarco-10m

Usage

python download_data.py movielens-large
python download_data.py stackoverflow-tiny
python download_data.py stackoverflow-small
python download_data.py stackoverflow-small --no-vectors
python download_data.py stackoverflow-small --vector-model all-MiniLM-L6-v2
python download_data.py stackoverflow-small --vector-batch-size 128
python download_data.py stackoverflow-small --vector-shard-size 100000
python download_data.py stackoverflow-small --vector-max-rows 50000
python download_data.py stackoverflow-small --vector-gt-queries 1000 --vector-gt-topk 50
python download_data.py stackoverflow-large
python download_data.py stackoverflow-xlarge
python download_data.py stackoverflow-full
python download_data.py msmarco-1m

Notes

  • MovieLens NULL injection is enabled by default (use --no-nulls to skip).
  • Stack Exchange vectors are generated by default for questions, answers, and comments. The script also builds a combined all corpus and computes exact ground truth (.gt.jsonl) from sampled queries, MSMARCO-style. Use --no-vectors to skip.
  • For Stack Exchange datasets, the script emits a copy-friendly entity count block after run (User, Post, Comment, Badge, Vote, PostLink, Tag, PostHistory, Total) so you can paste directly into markdown.
  • MSMARCO downloads parquet shards and converts them to vector shards with a ground-truth file.

Dependencies

Install only what you need for the datasets you plan to download:

  • Stack Exchange: py7zr
  • Stack Exchange vectors: sentence-transformers, torch, numpy
  • MSMARCO: huggingface_hub, numpy, pyarrow

Output Locations

  • MovieLens: examples/data/movielens-<size>/
  • Stack Exchange: examples/data/stackoverflow-<size>/
  • Stack Exchange vectors: examples/data/stackoverflow-<size>/vectors/
    • Includes per-corpus files (questions, answers, comments) and combined all files
  • MSMARCO: examples/data/MSMARCO-<size>/

Formats & Schemas

  • MovieLens: CSV files, no schema file generated.
  • Stack Exchange: XML files, no schema file generated.
  • Stack Exchange vectors: binary vector shards (.f32) plus .meta.json, .ids.jsonl, and combined .gt.jsonl.
    • Per-corpus outputs: stackoverflow-<size>-questions|answers|comments.{meta.json,ids.jsonl,shard*.f32}
    • Combined outputs: stackoverflow-<size>-all.meta.json, stackoverflow-<size>-all.ids.jsonl, stackoverflow-<size>-all.gt.jsonl
    • Defaults for combined GT: 1000 sampled queries, topk=50 (configurable via --vector-gt-queries and --vector-gt-topk)
    • Vectors are 384-D, L2-normalized (all-MiniLM-L6-v2).
  • MSMARCO: binary vector shards (.f32) plus .meta.json and .gt.jsonl.
    • Vectors are 1024‑D; 1M/5M/10M indicate the number of vectors.

Stack Overflow (sizes & counts)

Dataset sizes:

  • stackoverflow-tiny: ~34 MB disk (subset of small)
  • stackoverflow-small: ~642 MB disk
  • stackoverflow-medium: ~2.9 GB disk
  • stackoverflow-large: ~10 GB disk (subset of full)
  • stackoverflow-xlarge: ~50 GB disk (subset of full)
  • stackoverflow-full: ~323 GB disk

Expected document counts:

stackoverflow-tiny

  • User: 10,000
  • Post: 10,000
  • Comment: 10,000
  • Badge: 10,000
  • Vote: 10,000
  • PostLink: 10,000
  • Tag: 668
  • PostHistory: 10,000
  • Total: 70,668

stackoverflow-small

  • User: 138,727
  • Post: 105,373
  • Comment: 195,781
  • Badge: 182,975
  • Vote: 411,166
  • PostLink: 11,005
  • Tag: 668
  • PostHistory: 360,340
  • Total: 1,406,035

stackoverflow-medium

  • User: 345,754
  • Post: 425,735
  • Comment: 819,648
  • Badge: 612,258
  • Vote: 1,747,225
  • PostLink: 86,919
  • Tag: 1,612
  • PostHistory: 1,525,713
  • Total: 5,564,864

stackoverflow-large

  • User: 661,594
  • Post: 2,738,307
  • Comment: 2,723,828
  • Badge: 1,657,162
  • Vote: 7,691,408
  • PostLink: 204,690
  • Tag: 1,925
  • PostHistory: 6,970,840
  • Total: 22,649,754

stackoverflow-xlarge

  • User: 3,620,888
  • Post: 12,150,753
  • Comment: 13,800,490
  • Badge: 8,283,261
  • Vote: 38,042,985
  • PostLink: 1,023,432
  • Tag: 9,959
  • PostHistory: 31,269,534
  • Total: 108,201,302

stackoverflow-full

  • User: 22,484,235
  • Post: 59,819,048
  • Comment: 90,380,323
  • Badge: 51,289,973
  • Vote: 238,984,011
  • PostLink: 6,552,590
  • Tag: 65,675
  • PostHistory: 160,790,317
  • Total: 630,366,172

Stack Overflow vector counts (combined all corpus)

These counts come from generated vectors/stackoverflow-<size>-all.meta.json files (count field):

Dataset Vectors (count)
stackoverflow-tiny 19,591
stackoverflow-small 300,424
stackoverflow-medium 1,242,391
stackoverflow-large 5,461,227
stackoverflow-xlarge 25,910,526
stackoverflow-full 150,063,678

Approximate Sizes

Dataset Dataset size (non-vectors) Vector size
MovieLens small ~3.2 MB
MovieLens large ~1.5 GB
MSMARCO 1M ~3.9 GB
MSMARCO 5M ~20 GB
MSMARCO 10M ~39 GB
StackOverflow tiny ~34 MB ~35 MB
StackOverflow small ~642 MB ~495 MB
StackOverflow medium ~2.9 GB ~2.0 GB
StackOverflow large ~10 GB ~8.8 GB
StackOverflow xlarge ~50 GB ~42 GB
StackOverflow full ~323 GB