Skip to content

Vector Search - Semantic Similarity

View source code

Overview

This example demonstrates semantic similarity search using vector embeddings and HNSW (JVector) indexing. It covers:

  • Storing 384-dimensional vector embeddings
  • Creating HNSW (JVector) indexes for fast nearest-neighbor search
  • Performing semantic similarity searches
  • Understanding vector index parameters

Key Steps

1. Schema Definition

Create a vertex type with an embedding property:

db.command("sql", "CREATE VERTEX TYPE Article")
db.command("sql", "CREATE PROPERTY Article.title STRING")
db.command("sql", "CREATE PROPERTY Article.embedding ARRAY_OF_FLOATS")
db.command("sql", "CREATE PROPERTY Article.id INTEGER")
db.command("sql", "CREATE INDEX ON Article (id) UNIQUE")

Vector properties must use the ARRAY_OF_FLOATS type.

2. Generating Embeddings

The example generates 10,000 mock documents with 384-dimensional embeddings:

# Mock embedding generation (in production, use real models)
def create_mock_embedding(category_seed, doc_seed):
    rng = np.random.RandomState(hash(category_seed + doc_seed) % 2**32)
    category_vector = ...
    embedding = (category_vector + noise) / np.linalg.norm(...)
    return embedding.astype(np.float32)

Documents in the same category have embeddings that are closer together.

3. Inserting Data

Insert documents with embeddings in transactions:

with db.transaction():
    for doc in documents:
      db.command(
         "sql",
         "INSERT INTO Article SET title = ?, embedding = ?",
         doc["title"],
         arcadedb.to_java_float_array(doc["embedding"]),
      )

4. Creating Vector Index

Create a JVector index for similarity search:

index = db.create_vector_index(
    vertex_type="Article",
    vector_property="embedding",
    dimensions=384,
    distance_function="cosine"
)

Parameters: - dimensions: Must match embedding model size - distance_function: cosine (for normalized vectors), euclidean, or inner_product - build_graph_now: Defaults to True (eager graph preparation). Set to False to defer preparation to first query.

Find the k most similar documents to a query embedding:

query_embedding = create_mock_embedding(category, "query")
qvec_literal = "[" + ", ".join(str(float(x)) for x in query_embedding.tolist()) + "]"
rows = db.query(
    "sql",
    f"SELECT vectorNeighbors('Article[embedding]', {qvec_literal}, 5) as res",
).to_list()

for hit in rows[0].get("res", []):
    vertex = hit.get("record")
    distance = hit.get("distance")
    if vertex is not None:
        print(f"{vertex.get('title')}: {distance:.4f}")

The find_nearest() method returns (vertex, distance) pairs sorted by distance.

Example Output

Step 5: Creating vector index...
   💡 JVector Parameters:
      • dimensions: 384 (matches embedding size)
      • distance_function: cosine (best for normalized vectors)
      • max_connections: 16 (connections per node, higher = more accurate but slower)
      • beam_width: 100 (search quality, higher = more accurate)
   ✅ Created JVector vector index

Step 6: Performing semantic similarity searches...
   Running 10 queries on randomly sampled categories...

   🔍 Query 1: Find documents similar to Category 42
      Top 5 MOST similar documents (smallest distance):
      1. Article 67 about category_42
         Category: category_42, Distance: 0.7634
      2. Article 12 about category_42
         Category: category_42, Distance: 0.7698

Running the Example

cd bindings/python/examples
python 03_vector_search.py

Database files will be created in ./my_test_databases/vector_search_db/