Vector Search Guide¶
Bring your own embeddings (OpenAI, HF, local models). This guide focuses on how the Python bindings build and query vector indexes (JVector) in embedded mode.
Quick Start (Embedded, Minimal)¶
import arcadedb_embedded as arcadedb
from arcadedb_embedded import to_java_float_array
texts = ["python database", "graph queries", "vector search"]
with arcadedb.create_database("./vector_demo") as db:
db.schema.create_vertex_type("Doc")
db.schema.create_property("Doc", "text", "STRING")
db.schema.create_property("Doc", "embedding", "ARRAY_OF_FLOATS")
index = db.create_vector_index(
vertex_type="Doc",
vector_property="embedding",
dimensions=3, # must match your embedding size
distance_function="cosine", # default: cosine
max_connections=16, # Corresponds to M in HNSW (default)
beam_width=100 # Corresponds to efConstruction in HNSW (default)
)
with db.transaction():
for i, t in enumerate(texts):
embedding = [float(i == j) for j in range(3)] # toy vectors
v = db.new_vertex("Doc")
v.set("text", t)
v.set("embedding", to_java_float_array(embedding))
v.save()
results = index.find_nearest([0.9, 0.1, 0.0], k=2)
for vertex, score in results:
print(vertex.get("text"), score)
API Essentials¶
- Vector property type must be
ARRAY_OF_FLOATS. create_vector_index(vertex_type, vector_property, dimensions, distance_function="cosine", max_connections=16, beam_width=100, quantization="INT8", location_cache_size=None, graph_build_cache_size=None, mutations_before_rebuild=None, store_vectors_in_graph=False, add_hierarchy=True, pq_subspaces=None, pq_clusters=None, pq_center_globally=None, pq_training_limit=None)find_nearest(query_vector, k=10, overquery_factor=4, allowed_rids=None)overquery_factormultiplieskduring search to improve recall.allowed_ridsfilters candidates server-side (useful for metadata-prefilter).
Distance Functions (scoring behavior)¶
cosine(default): returns distance in [0,1]; lower is better.euclidean: returns similarity score \(1 / (1 + d^2)\); higher is better.inner_product: returns negative dot product; lower is better.
Tuning Knobs¶
dimensions: must match your embedding length.max_connections(HNSW M): higher → better recall, more memory/slow build (default: 16).beam_width(ef/efConstruction): higher → better recall, slower search/build (default: 100).overquery_factor(runtime only): higher → better recall, slower search.- Note: JVector doesn’t expose HNSW’s
efSearch;overquery_factoris the Python-side oversampling knob we added to compensate—set it higher when you need recall.
- Note: JVector doesn’t expose HNSW’s
Suggested presets from tests/examples (k=10):
- Min:
max_connections=12,beam_width=64,overquery_factor=2. - Normal (default):
max_connections=16,beam_width=100,overquery_factor=4. - Max:
max_connections=32,beam_width=200,overquery_factor=8.
Memory & Heap Requirements (1024-dim vectors)¶
Vector index build is the most memory-hungry step. For 1024-dimensional vectors:
Build (heap):
- 1M vectors: at least 4G
- 2M vectors: at least 8G
- 4M vectors: at least 16G
- 8M vectors: at least 32G
Search (heap):
- 1M vectors: 1G works, at least 1G recommended
- 2M vectors: 1G works, at least 2G recommended
- 4M vectors: 1G works, at least 2G recommended
- 8M vectors: 1G OOM, 2G works, at least 4G recommended
If you reduce vector dimensions (e.g., 384-dim), you can substantially lower heap requirements.
Generating Embeddings (example)¶
Use any model you like; the bindings only need a Python list/NumPy array of floats. A typical text workflow uses a Transformer-based embedding model, e.g., sentence-transformers with normalized outputs for cosine:
from sentence_transformers import SentenceTransformer
from arcadedb_embedded import to_java_float_array
model = SentenceTransformer("all-MiniLM-L6-v2") # 384 dims
doc_text = "retrieval augmented generation"
vec = model.encode(doc_text, normalize_embeddings=True)
with arcadedb.create_database("./vector_demo") as db:
db.schema.create_vertex_type("Doc")
db.schema.create_property("Doc", "text", "STRING")
db.schema.create_property("Doc", "embedding", "ARRAY_OF_FLOATS")
index = db.create_vector_index(
vertex_type="Doc",
vector_property="embedding",
dimensions=len(vec),
distance_function="cosine",
)
with db.transaction():
v = db.new_vertex("Doc")
v.set("text", doc_text)
v.set("embedding", to_java_float_array(vec))
v.save()
hits = index.find_nearest(vec, k=1)
Notes:
to_java_float_arrayaccepts NumPy arrays directly.- For cosine, pass
normalize_embeddings=True(as above) to your model. - For euclidean or inner_product, skip normalization if magnitude should matter.
Filtering by RID (prefilter)¶
rids = [row.get_rid() for row in db.query("sql", "SELECT @rid FROM Doc WHERE topic = 'ai'")]
results = index.find_nearest(query_vec, k=5, allowed_rids=rids)
Quantization (Experimental)¶
quantizationaccepts"INT8","BINARY","PRODUCT"(PQ), orNone(full precision).- Default is
"INT8"; setquantization=Nonefor full-precision vectors. - PQ tunables (require
quantization="PRODUCT"):pq_subspaces(M),pq_clusters(K),pq_center_globally,pq_training_limit. - INT8/BINARY have shown instability on larger inserts; PQ is the recommended quantization path for recall/latency trade-offs.
SQL Helpers (Optional)¶
- Prefer Schema API for embedded:
db.schema.create_index("Doc", ["embedding"], index_type="LSM_VECTOR") - SQL is also available (useful for remote servers):
CREATE INDEX ON Doc (embedding) LSM_VECTOR METADATA {"dimensions": 128, "distanceFunction": "COSINE"} - Search via SQL:
SELECT vectorNeighbors('Doc[embedding]', [0.1,0.2], 5) AS res- Or with type/property signature:
SELECT vectorNeighbors('Doc', 'embedding', [0.1,0.2], 5)
- Math/distance helpers:
vectorCosineSimilarity,vectorL2Distance,vectorDotProduct,vectorNormalize,vectorAdd,vectorSum, etc. - Quantization via SQL:
METADATA {"quantization": "INT8"}works for tiny datasets; larger inserts still unstable (see tests).