Vector API¶
Vector search capabilities in ArcadeDB use JVector (a graph-based index combining HNSW and DiskANN concepts) for fast approximate nearest neighbor search. Perfect for semantic search, recommendation systems, and similarity-based queries.
Overview¶
ArcadeDB's vector support enables:
- Semantic Search: Find similar documents, images, or any embedded content
- Recommendation Systems: Find similar items based on feature vectors
- Clustering: Group similar vectors together
- Anomaly Detection: Find outliers in vector space
Key Features:
- Graph-based indexing for O(log N) search performance
- Multiple distance metrics (cosine, euclidean, inner product)
- Native NumPy integration (optional)
- Configurable precision/performance trade-offs
Module Functions¶
Utility functions for converting between Python and Java vector representations:
to_java_float_array(vector)¶
Convert a Python array-like object to a Java float array compatible with ArcadeDB's vector indexing.
Parameters:
vector: Array-like object containing float values- Python list:
[0.1, 0.2, 0.3] - NumPy array:
np.array([0.1, 0.2, 0.3]) - Any iterable:
(0.1, 0.2, 0.3)
- Python list:
Returns:
- Java float array (
JArray<JFloat>)
Example:
import arcadedb_embedded as arcadedb
from arcadedb_embedded import to_java_float_array
import numpy as np
# From Python list
vec_list = [0.1, 0.2, 0.3, 0.4]
java_vec = to_java_float_array(vec_list)
# From NumPy array (if NumPy installed)
vec_np = np.array([0.5, 0.6, 0.7, 0.8], dtype=np.float32)
java_vec = to_java_float_array(vec_np)
# Use with SQL parameter binding
with db.transaction():
db.command(
"sql",
"INSERT INTO Document SET embedding = ?",
java_vec,
)
to_python_array(java_vector, use_numpy=True)¶
Convert a Java array or ArrayList to a Python array.
Parameters:
java_vector: Java array or ArrayList of floatsuse_numpy(bool): Return NumPy array if available (default:True)- If
Trueand NumPy is installed: returnsnp.ndarray - If
Falseor NumPy unavailable: returns Pythonlist
- If
Returns:
np.ndarray(ifuse_numpy=Trueand NumPy available)list(otherwise)
Example:
from arcadedb_embedded import to_python_array
# Get vector from vertex
vertex = result_set.next()
java_vec = vertex.get("embedding")
# Convert to NumPy array
np_vec = to_python_array(java_vec, use_numpy=True)
print(type(np_vec)) # <class 'numpy.ndarray'>
# Convert to Python list
py_list = to_python_array(java_vec, use_numpy=False)
print(type(py_list)) # <class 'list'>
VectorIndex Class¶
Wrapper for ArcadeDB's vector index, providing similarity search capabilities.
For creation, prefer SQL CREATE INDEX ... LSM_VECTOR METADATA {...}. For search,
prefer SQL or Cypher when you need filtering, projection, self-exclusion, or custom
score shaping. The Python methods below are convenience helpers for simple
embedded-mode workflows and for advanced/manual control. They are not the primary
application-facing workflow this documentation recommends.
For SQL snippets in this documentation, use vectorNeighbors(...) by default. The
engine also exposes an equivalent dotted canonical function name, but the alias is more
ergonomic in SQL because it does not require backticks.
Preferred Creation: SQL¶
Use SQL as the default creation surface:
db.command(
"sql",
"""
CREATE INDEX ON Document (embedding)
LSM_VECTOR
METADATA {
"dimensions": 384,
"similarity": "COSINE"
}
"""
)
SQL builds the vector graph immediately by default. Add "buildGraphNow": false only
if you intentionally want lazy preparation.
Creation via Database¶
Database.create_vector_index() still exists, but it should be treated as a secondary,
Python-driven helper for manual setup, tests, and API completeness rather than the
primary documented workflow.
Signature:
db.create_vector_index(
vertex_type: str,
vector_property: str,
dimensions: int,
id_property: str | None = None,
distance_function: str = "cosine",
max_connections: int = 16,
beam_width: int = 100,
quantization: str = "INT8",
location_cache_size: int | None = None,
graph_build_cache_size: int | None = None,
mutations_before_rebuild: int | None = None,
store_vectors_in_graph: bool = False,
add_hierarchy: bool | None = True,
pq_subspaces: int | None = None,
pq_clusters: int | None = None,
pq_center_globally: bool | None = None,
pq_training_limit: int | None = None,
build_graph_now: bool = True,
) -> VectorIndex
Parameters:
vertex_type(str): Vertex type containing vectorsvector_property(str): Property name storing vector arraysdimensions(int): Vector dimensionality (must match your embeddings)id_property(str | None): Optional property used for key-based lookup withfind_nearest_by_key(). Defaults to the engine default ("id") when omitted.distance_function(str): Distance metric (default:"cosine")"cosine": Cosine distance (1 - cosine similarity)"euclidean": Euclidean distance (L2 norm)"inner_product": Negative inner product
max_connections(int): Max connections per node (default: 16)- Maps to
maxConnectionsin JVector - Higher = better recall, more memory
- Typical range: 8-64
- Maps to
beam_width(int): Beam width for search/construction (default: 100)- Maps to
beamWidthin JVector - Higher = better recall, slower search
- Typical range: 50-500
- Maps to
quantization(str | None):"INT8"(recommended),"BINARY","PRODUCT", orNone(default:"INT8")- In current ArcadeDB engine builds,
"PRODUCT"also requires enough indexed vectors per bucket for PQ training. For tiny corpora, setpq_clustersexplicitly to a small value or prefer"INT8","BINARY", orNone. - Prefer
"INT8"for current production usage in these bindings. "PRODUCT"/PQ is available but currently not recommended for production workloads.
- In current ArcadeDB engine builds,
build_graph_now(bool): IfTrue(default), eagerly prepares the vector graph during index creation. Set toFalseto defer graph preparation until first query.
Returns:
VectorIndex: Index object for searching
Example:
import arcadedb_embedded as arcadedb
# Create database and schema
db = arcadedb.create_database("./vector_db")
db.command("sql", "CREATE VERTEX TYPE Document")
db.command("sql", "CREATE PROPERTY Document.id STRING")
db.command("sql", "CREATE PROPERTY Document.text STRING")
db.command("sql", "CREATE PROPERTY Document.embedding ARRAY_OF_FLOATS")
db.command("sql", "CREATE INDEX ON Document (id) UNIQUE")
# Secondary option: create vector index from Python
index = db.create_vector_index(
vertex_type="Document",
vector_property="embedding",
dimensions=384, # Match your embedding model
id_property="id",
distance_function="cosine",
max_connections=16,
beam_width=100
)
print(f"Created vector index: {index}")
VectorIndex.find_nearest(query_vector, k=10, ef_search=None, allowed_rids=None)¶
Find k-nearest neighbors to the query vector.
Treat this as a helper/manual API. For normal application queries, prefer SQL
vectorNeighbors so search composes naturally with filtering, projection, and record
exclusion.
Note: With default settings (build_graph_now=True in create_vector_index), graph
preparation runs during index creation. In the preferred SQL path, this eager behavior is
also the default. If you explicitly disable eager preparation, the first call to
find_nearest may perform lazy graph preparation and therefore take longer.
Parameters:
query_vector: Query vector as:- Python list:
[0.1, 0.2, ...] - NumPy array:
np.array([0.1, 0.2, ...]) - Any array-like iterable
- Python list:
k(int): Number of neighbors to return (default: 10)ef_search(int | None): Optional exact-search beam width override.Noneuses ArcadeDB's default/adaptive search behavior.allowed_rids(List[str]): Optional list of RID strings (e.g.["#1:0", "#2:5"]) to restrict search (default:None)
Returns:
List[Tuple[record, float]]: List of(record, distance)tuplesrecord: Matched ArcadeDB record object (Vertex, Document, or Edge)distance: Similarity score (float)- Cosine: lower = more similar
- Euclidean: higher = more similar
- Inner product: lower = more similar (negative dot product)
- Range depends on
distance_function
Example:
# Generate query vector (in practice, from your embedding model)
query_text = "machine learning tutorial"
query_vector = generate_embedding(query_text) # Your embedding function
# Search for 5 most similar documents
neighbors = index.find_nearest(query_vector, k=5)
# Search with RID filtering
allowed_rids = ["#10:5", "#10:8", "#10:12"]
filtered_neighbors = index.find_nearest(query_vector, k=5, allowed_rids=allowed_rids)
for record, distance in neighbors:
doc_id = record.get("id")
text = record.get("text")
print(f"Distance: {distance:.4f} | ID: {doc_id}")
print(f" Text: {text[:100]}...")
Preferred for richer query behavior:
qvec_literal = "[" + ", ".join(str(float(x)) for x in query_vector.tolist()) + "]"
rows = db.query(
"sql",
(
"SELECT id, distance, (1 - distance) AS score "
"FROM (SELECT expand(vectorNeighbors('Document[embedding]', "
f"{qvec_literal}, 10))) WHERE id <> ? ORDER BY distance LIMIT 5"
),
"doc-42",
).to_list()
Distance Interpretation:
| Function | Range | Similarity direction |
|---|---|---|
| cosine | [0, 2] | lower is better (0 = identical) |
| euclidean | (0, 1] | higher is better (1 = identical) |
| inner_product | (-∞, ∞) | lower is better (negative dot product) |
- Vertex must have the vector property populated
- Vector dimensionality must match index dimensions
- Call within a transaction for consistency
VectorIndex.find_nearest_by_key(key, k=10, ef_search=None, allowed_rids=None)¶
Find nearest neighbors by reusing the vector stored on an existing record.
This is the Python wrapper for the common "search from an existing record" workflow,
using the index's configured id_property to look up the source vector first.
Treat this as a convenience helper. If you need the recommended query surface, use SQL or Cypher instead.
Parameters:
key: Value of the configured ID propertyk(int): Number of neighbors to return (default: 10)ef_search(int | None): Optional exact-search beam width overrideallowed_rids(List[str] | None): Optional RID whitelist to restrict search
Returns:
List[Tuple[record, float]]: Same shape asfind_nearest()
Example:
neighbors = index.find_nearest_by_key("doc-42", k=5)
for record, distance in neighbors:
print(record.get("id"), distance)
The helper keeps current nearest-neighbor semantics, so the source record may also be
returned. If you want to exclude it, do that in SQL/Cypher with a WHERE clause.
VectorIndex.get_metadata()¶
Return stable vector index metadata as a Python dictionary.
Returns:
dictwith keys such as:index_namebucket_index_nametype_namevector_propertydimensionssimilarity_functionid_propertyquantizationmax_connectionsbeam_widthstore_vectors_in_graphbuild_state
Example:
meta = index.get_metadata()
print(meta["dimensions"], meta["similarity_function"], meta["id_property"])
VectorIndex.build_graph_now()¶
Force an immediate rebuild/preparation of the vector graph.
This is a maintenance API, not part of the normal SQL-first creation workflow.
Use this when you want to control when rebuild cost is paid, for example:
- after bulk inserts,
- after bulk deletes/removals,
- before opening traffic after large vector mutations.
This is especially useful if you created the index with build_graph_now=False or with
SQL metadata "buildGraphNow": false and want to avoid rebuild work on the first query.
When you create an LSM_VECTOR index through SQL, ArcadeDB now builds the graph
immediately by default. Use build_graph_now() only for explicit maintenance or when
you intentionally deferred graph preparation.
Returns:
None
Example:
Complete Examples¶
Semantic Search with Sentence Transformers¶
import arcadedb_embedded as arcadedb
from arcadedb_embedded import to_java_float_array
from sentence_transformers import SentenceTransformer
import numpy as np
# Load embedding model
model = SentenceTransformer('all-MiniLM-L6-v2') # 384 dimensions
# Create database and schema
db = arcadedb.create_database("./semantic_search")
db.command("sql", "CREATE VERTEX TYPE Document")
db.command("sql", "CREATE PROPERTY Document.id STRING")
db.command("sql", "CREATE PROPERTY Document.title STRING")
db.command("sql", "CREATE PROPERTY Document.content STRING")
db.command("sql", "CREATE PROPERTY Document.embedding ARRAY_OF_FLOATS")
db.command("sql", "CREATE INDEX ON Document (id) UNIQUE")
# Preferred: create vector index in SQL
db.command(
"sql",
"""
CREATE INDEX ON Document (embedding)
LSM_VECTOR
METADATA {
"dimensions": 384,
"similarity": "COSINE",
"maxConnections": 16,
"beamWidth": 100
}
"""
)
index = db.schema.get_vector_index("Document", "embedding")
# Sample documents
documents = [
{"id": "doc1", "title": "Python Tutorial",
"content": "Learn Python programming basics"},
{"id": "doc2", "title": "Machine Learning Guide",
"content": "Introduction to ML algorithms"},
{"id": "doc3", "title": "Database Systems",
"content": "Understanding relational databases"},
]
# Index documents
print("Indexing documents...")
with db.transaction():
for doc in documents:
# Generate embedding
text = f"{doc['title']} {doc['content']}"
embedding = model.encode(text)
db.command(
"sql",
"INSERT INTO Document SET id = ?, title = ?, content = ?, embedding = ?",
doc["id"],
doc["title"],
doc["content"],
to_java_float_array(embedding),
)
print(f"Indexed {len(documents)} documents")
# Search
query = "How to learn programming"
query_embedding = model.encode(query)
print(f"\nQuery: '{query}'")
results = index.find_nearest(query_embedding, k=3)
for vertex, distance in results:
print(f"\nDistance: {distance:.4f}")
print(f"Title: {vertex.get('title')}")
print(f"Content: {vertex.get('content')}")
db.close()
Hybrid Search (Vector + Filters)¶
Combine vector similarity with property filters using SQL:
import arcadedb_embedded as arcadedb
from arcadedb_embedded import to_java_float_array
import numpy as np
db = arcadedb.open_database("./products_db")
# Create schema
db.command("sql", "CREATE VERTEX TYPE Product")
db.command("sql", "CREATE PROPERTY Product.id STRING")
db.command("sql", "CREATE PROPERTY Product.name STRING")
db.command("sql", "CREATE PROPERTY Product.category STRING")
db.command("sql", "CREATE PROPERTY Product.price DECIMAL")
db.command("sql", "CREATE PROPERTY Product.features ARRAY_OF_FLOATS")
db.command("sql", "CREATE INDEX ON Product (category) NOTUNIQUE")
# Create vector index in SQL
db.command(
"sql",
"""
CREATE INDEX ON Product (features)
LSM_VECTOR
METADATA {
"dimensions": 128,
"similarity": "COSINE"
}
"""
)
index = db.schema.get_vector_index("Product", "features")
# Add products with feature vectors
products = [
{"id": "p1", "name": "Laptop", "category": "Electronics",
"price": 999.99, "features": np.random.rand(128)},
{"id": "p2", "name": "Mouse", "category": "Electronics",
"price": 29.99, "features": np.random.rand(128)},
{"id": "p3", "name": "Desk", "category": "Furniture",
"price": 299.99, "features": np.random.rand(128)},
]
with db.transaction():
for prod in products:
db.command(
"sql",
"INSERT INTO Product SET id = ?, name = ?, category = ?, price = ?, features = ?",
prod["id"],
prod["name"],
prod["category"],
prod["price"],
to_java_float_array(prod["features"]),
)
# Note: LSM vector index automatically indexes new records
# Hybrid search: vector similarity + filters
query_features = np.random.rand(128)
candidates = index.find_nearest(query_features, k=100) # Get many candidates
# Filter by category and price
filtered_results = []
for vertex, distance in candidates:
category = vertex.get("category")
price = float(vertex.get("price"))
if category == "Electronics" and price < 500:
filtered_results.append((vertex, distance))
if len(filtered_results) >= 5: # Want top 5 after filtering
break
print("Filtered Results:")
for vertex, distance in filtered_results:
print(f"{vertex.get('name')} - ${vertex.get('price')} - {distance:.4f}")
db.close()
Image Similarity Search¶
import arcadedb_embedded as arcadedb
from arcadedb_embedded import to_java_float_array
from PIL import Image
import numpy as np
# Assuming you have a function to generate image embeddings
def get_image_embedding(image_path):
"""
Generate embedding for image using your model
(e.g., ResNet, CLIP, etc.)
"""
# Placeholder - use your actual embedding model
return np.random.rand(512) # Example: 512-dim embedding
db = arcadedb.create_database("./image_search")
# Schema
db.command("sql", "CREATE VERTEX TYPE Image")
db.command("sql", "CREATE PROPERTY Image.id STRING")
db.command("sql", "CREATE PROPERTY Image.filename STRING")
db.command("sql", "CREATE PROPERTY Image.path STRING")
db.command("sql", "CREATE PROPERTY Image.embedding ARRAY_OF_FLOATS")
# Create index in SQL
db.command(
"sql",
"""
CREATE INDEX ON Image (embedding)
LSM_VECTOR
METADATA {
"dimensions": 512,
"similarity": "COSINE",
"maxConnections": 24,
"beamWidth": 200
}
"""
)
index = db.schema.get_vector_index("Image", "embedding")
# Index images
image_files = ["img1.jpg", "img2.jpg", "img3.jpg"]
with db.transaction():
for idx, img_file in enumerate(image_files):
embedding = get_image_embedding(img_file)
v = db.new_vertex("Image")
v.set("id", f"img_{idx}")
v.set("filename", img_file)
v.set("path", f"/images/{img_file}")
v.set("embedding", to_java_float_array(embedding))
v.save()
# Note: LSM vector index automatically indexes new records
# Search for similar images
query_image = "query.jpg"
query_embedding = get_image_embedding(query_image)
similar_images = index.find_nearest(query_embedding, k=5)
print(f"Similar images to {query_image}:")
for vertex, distance in similar_images:
print(f" {vertex.get('filename')} - similarity: {1 - distance:.4f}")
db.close()
Performance Tuning¶
Vector Index Parameters¶
max_connections (connections per node):
- Lower (12): Faster build, less memory, lower recall
- Medium (16): Balanced (default)
- Higher (32): Better recall, more memory, slower build
ef_search (exact search beam width):
- Unset (
None): Use ArcadeDB's default/adaptive behavior - Lower (32): Faster search, lower recall
- Medium (100): Balanced explicit override
- Higher (200): Better recall, slower search
beam_width:
- Lower (64): Faster build, lower quality
- Medium (100): Balanced (default)
- Higher (200): Better quality, slower build
Distance Functions¶
Cosine Distance:
- Best for: Text embeddings, normalized vectors
- Range: [0, 2], lower is better
- Use when: Direction matters more than magnitude
Euclidean Distance:
- Best for: Image embeddings, spatial data
- Range: [0, ∞), lower is better
- Use when: Absolute distance matters
Inner Product:
- Best for: Collaborative filtering, when vectors aren't normalized
- Range: (-∞, ∞), higher is better (note: inverted!)
- Use when: Magnitude information is important
Memory Considerations¶
Approximate memory per vertex:
Example for 384-dim vectors with max_connections=16:
For 1 million vectors: ~1.8 GB RAM
Error Handling¶
from arcadedb_embedded import ArcadeDBError, to_java_float_array
import numpy as np
try:
# Dimension mismatch
db.command(
"sql",
'CREATE INDEX ON Doc (emb) LSM_VECTOR METADATA {"dimensions": 384}',
)
v = db.new_vertex("Doc")
v.set("emb", to_java_float_array(np.random.rand(512))) # Wrong size!
v.save()
# Indexing happens automatically and may fail asynchronously or on next access
except ArcadeDBError as e:
print(f"Error: {e}")
# Handle dimension mismatch
try:
# Missing vector property
v = db.new_vertex("Doc")
v.set("id", "doc1")
# Forgot to set embedding!
v.save()
# Indexing happens automatically
except ArcadeDBError as e:
print(f"Error: {e}")
# Handle missing property
See Also¶
- Vector Search Guide - Comprehensive vector search strategies
- Vector Examples - More practical examples
- Database API - Database operations
- Query Guide - Combining vectors with queries