Vector Search Tests (JVector / LSM)¶

View source code

The test suite exercises the current Java-native JVector + LSM vector index used by ArcadeDB (no Python hnswlib dependency). All tests run through the Python bindings.

Overview¶

What the tests cover:

✅ HNSW (JVector)/LSM index creation via SQL and Python helper coverage
✅ Nearest-neighbor search via SQL and embedded helper coverage
✅ RID filtering using allowed_rids
✅ Exact-search beam tuning (ef_search)
✅ Distance functions (cosine default, euclidean variants)
✅ Persistence & size checks (index files survive reopen)
✅ Chunked inserts via explicit transactions (preferred for embedded)

Test Coverage (high level)¶

test_create_vector_index – covers the Python helper surface for vector index creation
test_lsm_vector_search – basic nearest-neighbor search through the embedded helper
test_lsm_vector_search_with_filter – allowed_rids filtering
test_lsm_vector_delete_and_search_others – deletes vertices, ensures others are still found
test_lsm_vector_search_ef_search – adjusts ef_search
test_get_vector_index_lsm – fetches index metadata
test_lsm_index_size – asserts index file presence/size
test_lsm_persistence – reopen DB and reuse the index
Distance suites – cosine/euclidean correctness for orthogonal/parallel/opposite/high-dim vectors
test_lsm_vector_search_comprehensive – end-to-end search path

SQL Vector Functions Tests¶

SQL vector operations are tested separately in test_vector_sql.py, including vector math functions, distance calculations, aggregations, quantization (with known limitations), and SQL-based index creation and search.

Common Patterns¶

Create JVector (LSM-backed) index¶

with arcadedb.create_database("./test_db") as db:
    db.command("sql", "CREATE VERTEX TYPE Doc")
    db.command("sql", "CREATE PROPERTY Doc.embedding ARRAY_OF_FLOATS")

    db.command(
        "sql",
        '''
        CREATE INDEX ON Doc (embedding)
        LSM_VECTOR
        METADATA {
            "dimensions": 384,
            "similarity": "COSINE",
            "maxConnections": 16,
            "beamWidth": 100
        }
        ''',
    )

Search with filters and ef_search¶

with arcadedb.create_database("./test_db") as db:
    db.command("sql", "CREATE VERTEX TYPE Doc")
    db.command("sql", "CREATE PROPERTY Doc.docId INTEGER")
    db.command("sql", "CREATE PROPERTY Doc.embedding ARRAY_OF_FLOATS")
    db.command(
        "sql",
        'CREATE INDEX ON Doc (embedding) LSM_VECTOR METADATA {"dimensions": 3}',
    )

    # Insert test vertices with embeddings
    with db.transaction():
        doc1 = db.new_vertex("Doc", docId=1, embedding=[1.0, 0.0, 0.0])
        doc1.save()
        doc2 = db.new_vertex("Doc", docId=2, embedding=[0.0, 1.0, 0.0])
        doc2.save()

    # Search with filters
    query = [1.0, 0.0, 0.0]
    allowed_rids_sql = f"['{doc1.get_rid()}', '{doc2.get_rid()}']"
    query_literal = "[" + ", ".join(str(float(v)) for v in query) + "]"
    results = db.query(
        "sql",
        (
            "SELECT expand(vectorNeighbors('Doc[embedding]', "
            f"{query_literal}, 2, 100)) WHERE @rid IN {allowed_rids_sql}"
        ),
    ).to_list()

Chunked insert vectors (preferred)¶

with arcadedb.create_database("./test_db") as db:
    db.command("sql", "CREATE VERTEX TYPE Doc")
    db.command("sql", "CREATE PROPERTY Doc.docId INTEGER")
    db.command("sql", "CREATE PROPERTY Doc.embedding ARRAY_OF_FLOATS")

    # Prefer chunked transactions for embedded (avoids batch_context overhead)
    vectors = [[1.0, 0.0, 0.0], [0.0, 1.0, 0.0], [0.0, 0.0, 1.0]]
    chunk_size = 100
    for start in range(0, len(vectors), chunk_size):
        with db.transaction():
            for idx, vec in enumerate(vectors[start : start + chunk_size]):
                doc = db.new_vertex("Doc")
                doc.set("docId", start + idx)
                doc.set("embedding", vec)
                doc.save()

Key Takeaways¶

JVector is fully Java-native and LSM-backed; no legacy hnswlib path remains.
max_connections and beam_width map to JVector graph degree and search beam; tune per workload.
Prefer chunked db.transaction() inserts for embedded workloads rather than a separate batching abstraction.