Example 11: Vector Index Build¶
This example benchmarks vector ingest and index-build time across several vector backends.
Overview¶
Example 11 is the build-only vector benchmark.
- It resolves either MSMARCO or Stack Overflow vector shards.
- It ingests vectors into the chosen backend.
- It builds an HNSW-style index, or the nearest equivalent supported by that backend.
- It reports ingest time, index-build time, disk usage, and peak RSS.
Supported Backends¶
arcadedb_sqlpgvectorqdrantmilvusfaisslancedb
Datasets¶
MSMARCO-*stackoverflow-*
Run¶
From bindings/python/examples:
python 11_vector_index_build.py \
--backend arcadedb_sql \
--dataset stackoverflow-tiny \
--max-connections 16 \
--beam-width 100 \
--batch-size 10000 \
--mem-limit 4g
Shared Build Parameters¶
The example normalizes build settings around two knobs:
max_connections: HNSWm-style connectivitybeam_width: HNSWef_construction-style build breadth
The exact backend call differs, but these two values are threaded through the build in every engine that supports them.
Exact Build Operations By Backend¶
ArcadeDB¶
The ArcadeDB path explicitly creates schema and ingests vectors with SQL.
Schema DDL¶
CREATE VERTEX TYPE VectorData
CREATE PROPERTY VectorData.id INTEGER
CREATE PROPERTY VectorData.vector ARRAY_OF_FLOATS
ArcadeDB Ingest Statement¶
ArcadeDB Index Build Call¶
CREATE INDEX ON VectorData (vector)
LSM_VECTOR
METADATA {
"dimensions": <dim>,
"similarity": "COSINE",
"maxConnections": <max_connections>,
"beamWidth": <beam_width>,
"quantization": <quant>,
"storeVectorsInGraph": <store_vectors_in_graph>,
"addHierarchy": <add_hierarchy>
}
This is the exact Example 11 build path for ArcadeDB. The Python object helper exists, but the benchmark and the recommended docs path use SQL.
FAISS¶
FAISS uses IndexHNSWFlat wrapped in IndexIDMap2.
Index Construction¶
index_hnsw = faiss.IndexHNSWFlat(
int(dim),
int(max_connections),
faiss.METRIC_INNER_PRODUCT,
)
index_hnsw.hnsw.efConstruction = int(beam_width)
index = faiss.IndexIDMap2(index_hnsw)
FAISS Ingest Call¶
LanceDB¶
LanceDB first creates a table of {id, vector} rows, then tries HNSW-like index modes.
Table Creation¶
The first batch creates the table with:
Later batches append via:
Index Creation Attempts¶
The exact code tries these index types in order:
HNSWIVF_HNSW_SQwithnum_partitions=1
The build call is:
table.create_index(
index_type=index_type,
metric="cosine",
vector_column_name="vector",
m=int(max_connections),
ef_construction=int(beam_width),
**extra_kwargs,
)
pgvector¶
The PostgreSQL vector path uses the vector extension and an HNSW index.
Schema Setup¶
CREATE EXTENSION IF NOT EXISTS vector
DROP TABLE IF EXISTS vectordata
CREATE TABLE vectordata(id INTEGER PRIMARY KEY, vector vector({dim}))
Ingest Statement¶
Index Build Statement¶
CREATE INDEX vectordata_vector_hnsw ON vectordata USING hnsw (vector vector_cosine_ops) WITH (m = {m_val}, ef_construction = {ef_val})
The build finishes with:
Qdrant¶
Qdrant recreates the collection with cosine distance and HNSW config.
Collection Definition¶
client.recreate_collection(
collection_name=collection_name,
vectors_config=models.VectorParams(
size=int(dim),
distance=models.Distance.COSINE,
),
hnsw_config=models.HnswConfigDiff(
m=int(max_connections),
ef_construct=int(beam_width),
),
)
Ingest Objects¶
These points are sent in upsert batches.
Milvus¶
Milvus creates an explicit collection schema before building an HNSW index.
Collection Schema¶
schema = CollectionSchema(
fields=[
FieldSchema(
name="id",
dtype=DataType.INT64,
is_primary=True,
auto_id=False,
),
FieldSchema(
name="vector",
dtype=DataType.FLOAT_VECTOR,
dim=dim,
),
],
description="Vector benchmark collection",
)
Milvus Index Build Call¶
index_params = {
"index_type": "HNSW",
"metric_type": "COSINE",
"params": {
"M": int(max_connections),
"efConstruction": int(beam_width),
},
}
collection.create_index(field_name="vector", index_params=index_params)
Milvus Ingest Call¶
Notes¶
- Client-server backends report combined client and server RSS.
- ArcadeDB exposes extra build knobs such as
quantization,store_vectors_in_graph, andadd_hierarchy. - LanceDB may fall back from
HNSWtoIVF_HNSW_SQdepending on what the installed version supports.