Vector Search Tests (JVector / LSM)¶

View source code

The test suite exercises the current Java-native JVector + LSM vector index used by ArcadeDB (no Python hnswlib dependency). All tests run through the Python bindings.

Overview¶

What the tests cover:

✅ HNSW (JVector)/LSM index creation via create_vector_index
✅ Nearest-neighbor search with find_nearest
✅ RID filtering using allowed_rids
✅ Overquery factor tuning (overquery_factor)
✅ Distance functions (cosine default, euclidean variants)
✅ Persistence & size checks (index files survive reopen)
✅ Chunked inserts via explicit transactions (preferred for embedded)

Test Coverage (high level)¶

test_create_vector_index – creates HNSW (JVector)/LSM index and verifies schema listing
test_lsm_vector_search – basic nearest-neighbor search
test_lsm_vector_search_with_filter – allowed_rids filtering
test_lsm_vector_delete_and_search_others – deletes vertices, ensures others are still found
test_lsm_vector_search_overquery – adjusts overquery_factor
test_get_vector_index_lsm – fetches index metadata
test_lsm_index_size – asserts index file presence/size
test_lsm_persistence – reopen DB and reuse the index
Distance suites – cosine/euclidean correctness for orthogonal/parallel/opposite/high-dim vectors
test_lsm_vector_search_comprehensive – end-to-end search path

SQL Vector Functions Tests¶

SQL vector operations are tested separately in test_vector_sql.py, including vector math functions, distance calculations, aggregations, quantization (with known limitations), and SQL-based index creation and search.

Common Patterns¶

Create JVector (LSM-backed) index¶

with arcadedb.create_database("./test_db") as db:
    db.command("sql", "CREATE VERTEX TYPE Doc")
    db.command("sql", "CREATE PROPERTY Doc.embedding ARRAY_OF_FLOATS")

    index = db.create_vector_index(
        "Doc",
        "embedding",
        dimensions=384,
        distance_function="cosine",   # default
        max_connections=16,            # graph degree (default)
        beam_width=100                 # search/construction beam (default)
    )

Search with filters and overquery factor¶

with arcadedb.create_database("./test_db") as db:
    db.command("sql", "CREATE VERTEX TYPE Doc")
    db.command("sql", "CREATE PROPERTY Doc.embedding ARRAY_OF_FLOATS")

    index = db.create_vector_index(
        "Doc",
        "embedding",
        dimensions=3,
    )

    # Insert test vertices with embeddings
    with db.transaction():
        doc1 = db.new_vertex("Doc", docId=1, embedding=[1.0, 0.0, 0.0])
        doc1.save()
        doc2 = db.new_vertex("Doc", docId=2, embedding=[0.0, 1.0, 0.0])
        doc2.save()

    # Search with filters
    query = [1.0, 0.0, 0.0]
    results = index.find_nearest(
        query,
        k=2,
        allowed_rids=[doc1.get_rid(), doc2.get_rid()],
        overquery_factor=4,
    )

Chunked insert vectors (preferred)¶

with arcadedb.create_database("./test_db") as db:
    db.command("sql", "CREATE VERTEX TYPE Doc")
    db.command("sql", "CREATE PROPERTY Doc.docId INTEGER")
    db.command("sql", "CREATE PROPERTY Doc.embedding ARRAY_OF_FLOATS")

    # Prefer chunked transactions for embedded (avoids batch_context overhead)
    vectors = [[1.0, 0.0, 0.0], [0.0, 1.0, 0.0], [0.0, 0.0, 1.0]]
    chunk_size = 100
    for start in range(0, len(vectors), chunk_size):
        with db.transaction():
            for idx, vec in enumerate(vectors[start : start + chunk_size]):
                doc = db.new_vertex("Doc")
                doc.set("docId", start + idx)
                doc.set("embedding", vec)
                doc.save()

Key Takeaways¶

JVector is fully Java-native and LSM-backed; no legacy hnswlib path remains.
Use allowed_rids for pre-filtered searches and overquery_factor for recall/speed trade-offs.
max_connections and beam_width map to JVector graph degree and search beam; tune per workload.
Prefer chunked db.transaction() inserts for embedded workloads; reserve batch_context for legacy/tests that explicitly need it.