Example 05: CSV Import - Graph¶

Production-ready graph creation from MovieLens dataset with comprehensive performance analysis

Overview¶

This example demonstrates creating a graph database from the MovieLens dataset, transforming documents into vertices and edges. You'll learn production-ready patterns for:

Graph modeling - Users and Movies as vertices, ratings and tags as edges
SQL pipeline - End-to-end graph creation with SQL DDL/DML
Async vs Sync - Compare async vs sync modes in embedded mode
Index optimization - Create indexes BEFORE bulk edge creation for 2-3× speedup
Export & roundtrip validation - Verify data integrity through complete cycle
Performance benchmarking - Measure and compare 6 different configurations
Query validation - 10 graph queries with result verification

What You'll Learn¶

Graph schema design (vertex types, edge types, properties)
Foreign key resolution (userId → User vertex, movieId → Movie vertex)
Synchronous is faster than async in embedded mode (no network I/O to parallelize)
Indexes provide 2-3× speedup for edge creation
Export/import performance and compression ratios (50-80×)
Graph query patterns (MATCH, collaborative filtering, recommendations)
Production import patterns for large-scale graphs

Prerequisites¶

1. Install ArcadeDB Python bindings:

uv pip install arcadedb-embedded

2. Dataset download (automatic):

The example automatically downloads the dataset if it doesn't exist. You can also use a pre-existing document database from Example 04.

Two dataset sizes available:

movielens-large: ~330K users, ~86K movies, ~33M edges (~971 MB CSV) - Realistic performance testing
movielens-small: ~610 users, ~9,700 movies, ~100K edges (~3.2 MB CSV) - Quick testing

Usage¶

# Recommended SQL configuration (synchronous)
python 05_csv_import_graph.py --dataset movielens-small --method sql --no-async

# Run with export for roundtrip validation
python 05_csv_import_graph.py --dataset movielens-small --batch-size 5000 --method java --no-async --export

# Comprehensive benchmark (all 6 configurations in parallel)
./run_benchmark_05_csv_import_graph.sh movielens-small 5000 4 all_6 --export

# See all options
python 05_csv_import_graph.py --help

Key options:

--dataset {movielens-small,movielens-large} - Dataset size (default: movielens-small)
--method {java,sql} - Creation method (DSL-first recommendation: sql)
--no-async - Use synchronous transactions (FASTER for embedded mode)
--no-index - Skip creating indexes (slower, for comparison)
--batch-size BATCH_SIZE - Batch size for operations (default: 5000)
--parallel PARALLEL - Async executor threads (default: 4, not recommended)
--export - Export graph database to JSONL after creation
--source-db SOURCE_DB - Custom source database path
--import-jsonl IMPORT_JSONL - Import from JSONL export instead of CSV

Recommendations:

Method: Use sql for consistency with DSL-first examples and guides
Async: Use --no-async for best performance (2.5× faster in embedded mode)
Indexes: Keep enabled (2-3× speedup for edge creation)
Batch size: 5000 for small, 50000 for large datasets
Export: Use --export to validate data integrity via roundtrip

Graph Schema¶

Vertex Types¶

User

Properties: userId (LONG, indexed), name (STRING)
Count: 610 (small) / 330,000 (large)

Movie

Properties: movieId (LONG, indexed), title (STRING), genres (STRING), imdbId (STRING), tmdbId (STRING)
Count: 9,742 (small) / 86,537 (large)

Edge Types¶

RATED (User → Movie)

Properties: rating (FLOAT), timestamp (LONG)
Count: 98,734 (small) / 33,155,309 (large)

TAGGED (User → Movie)

Properties: tag (STRING), timestamp (LONG)
Count: 3,494 (small) / 2,212,213 (large)

Performance Results¶

Small Dataset (610 users, 9,742 movies, 102,228 edges)¶

Method	Vertices	Edges	Creation Time	Memory (Peak)
java_noasync ⚡	11,528/s	6,927/s	15.65s	5.5 GB
java_noindex_noasync	12,514/s	6,766/s	15.94s	5.5 GB
java (async)	4,453/s	6,516/s	18.01s	5.6 GB
java_noindex (async)	4,255/s	6,143/s	19.07s	5.0 GB
sql_noindex	5,383/s	5,225/s	21.49s	5.6 GB
sql	4,882/s	4,956/s	22.75s	5.5 GB

Large Dataset (330K users, 86K movies, 35.4M edges)¶

Method	Vertices	Edges	Creation Time	Memory (Peak)
java_noasync ⚡	12,341/s	5,071/s	1h 57m	16.0 GB
sql	8,734/s	3,789/s	2h 36m	16.0 GB
java (async)	2,469/s	2,034/s	4h 52m	18.4 GB
java_noindex_noasync	13,036/s	1,839/s	5h 21m	15.7 GB
sql_noindex	8,902/s	1,773/s	5h 33m	15.6 GB
java_noindex (async)	2,662/s	820/s	12h 1m	18.3 GB

Key Performance Insights¶

1. 🚀 Synchronous Java API is FASTEST for Bulk Edge Creation¶

Winner: java_noasync (5,071 edges/sec)

Large Dataset Edge Creation:
✅ java_noasync:           5,071 edges/sec  ← FASTEST (sync Java API)
✅ sql:                    3,789 edges/sec  (25% slower)
❌ java (async):           2,034 edges/sec  (60% slower)

Why? In embedded mode, there's no network I/O to parallelize. Async machinery (futures, thread coordination) adds overhead without benefits. Direct synchronous Java API calls are the fastest path.

2. 📊 Indexes Provide 2-3× Speedup¶

Java API Edge Creation (Large Dataset):
✅ WITH indexes (java_noasync):     5,071 edges/sec  ← 2.8× faster
❌ WITHOUT indexes:                 1,839 edges/sec

SQL Edge Creation (Large Dataset):
✅ WITH indexes (sql):              3,789 edges/sec  ← 2.1× faster
❌ WITHOUT indexes:                 1,773 edges/sec

Best Practice: Create indexes BEFORE bulk edge creation (unlike documents where indexes come after).

3. ⚡ Async Hurts Performance in Embedded Mode¶

Java API with Indexes (Large Dataset):
✅ Synchronous (java_noasync):     5,071 edges/sec  ← 2.5× faster
❌ Async (java):                   2,034 edges/sec

Vertex Creation:
✅ Synchronous:                   12,341 vertices/sec  ← 5× faster
❌ Async:                          2,469 vertices/sec

Why? Async is designed for I/O-bound operations (network, disk). In embedded mode, all operations are in-memory JVM calls. Async overhead (thread pools, futures, synchronization) wastes CPU cycles.

4. 🎯 DSL-First: Use SQL for Ingestion and Queries¶

Use SQL For:

✅ Bulk edge/vertex creation with one consistent DSL across examples
✅ Complex graph queries (MATCH patterns, multi-hop traversals)
✅ Aggregations (GROUP BY, COUNT, AVG)
✅ Ad-hoc analysis and prototyping
✅ When readability and portability matter
✅ Cypher compatibility (Neo4j migration path)

5. 💾 Memory Usage: Heap vs Total Process Memory¶

Large Dataset Memory (8GB JVM Heap):
Peak RSS (actual memory):  15.6 - 18.8 GB  ← Total process memory
JVM Heap (-Xmx):           8.0 GB           ← Just the heap portion

Total = Heap + Non-Heap (metaspace, page cache, thread stacks, direct buffers)

Insight: JVM heap setting (-Xmx) only limits heap memory. Total process memory includes metaspace, page cache, thread stacks, and direct buffers. Plan for 1.5-2× your heap size in actual RAM.

Export & Roundtrip Validation¶

Export Performance¶

Large Dataset Export (35.4M edges):
✅ java_noasync:    122.53s  (288,815 records/sec)
✅ sql:             187.54s  (188,682 records/sec)

Compression Ratio: 50-80× (JSONL → gzip)
File Size: 670 MB compressed (from ~33 GB uncompressed)

Roundtrip Import Performance¶

Large Dataset Import from JSONL:
✅ java_noasync:    1,365.41s  (25,906 records/sec)
✅ sql:             1,783.64s  (19,843 records/sec)

Total Roundtrip (Export + Import):
✅ java_noasync:    1,487.94s  (24.8 min)
✅ sql:             1,971.18s  (32.9 min)

Validation: All roundtrip imports passed 10 query validations, confirming data integrity.

Graph Queries¶

The example includes 10 comprehensive graph queries (8 SQL + 2 OpenCypher):

Query 1: Count Vertices by Type¶

SELECT labels()[0] as type, count(*) as count
FROM V
GROUP BY type
ORDER BY count DESC

Query 2: Count Edges by Type¶

SELECT labels()[0] as type, count(*) as count
FROM E
GROUP BY type
ORDER BY count DESC

Query 3: Sample User Ratings¶

MATCH {type: User, where: (userId = 1)}-RATED->{type: Movie}
RETURN $patternPathEdges
LIMIT 5

Query 4: Sample User Tags¶

MATCH {type: User, where: (userId = 2)}-TAGGED->{type: Movie}
RETURN $patternPathEdges
LIMIT 5

Query 5: High-Rated Movies¶

MATCH {type: User}-RATED->{type: Movie, as: m}
RETURN m.title, m.movieId, m.genres
ORDER BY m.movieId
LIMIT 10

Query 6: Collaborative Filtering (Most Complex)¶

MATCH {type: User, where: (userId = 1)}-RATED->{type: Movie}<-RATED-{type: User, as: otherUser}
RETURN otherUser.userId, count(*) as shared_movies
GROUP BY otherUser.userId
ORDER BY shared_movies DESC
LIMIT 10

Query 7-8: Additional SQL Patterns¶

Query 7: Users with similar taste (SQL MATCH + aggregation)
Query 8: Rating distribution (SQL aggregation, filters NULL ratings)

Query 9-10: OpenCypher Patterns¶

Query 9: User's top-rated movies (OpenCypher traversal with filtering)
Query 10: Collaborative filtering (OpenCypher aggregation)

Note: These queries use OpenCypher patterns as an alternative to SQL MATCH syntax.

Code Walkthrough¶

Step 1: Create Graph Schema¶

# Create vertex types
db.command("sql", "CREATE VERTEX TYPE User IF NOT EXISTS")
db.command("sql", "CREATE PROPERTY User.userId LONG")
db.command("sql", "CREATE PROPERTY User.name STRING")
db.command("sql", "CREATE INDEX ON User (userId) UNIQUE")

db.command("sql", "CREATE VERTEX TYPE Movie IF NOT EXISTS")
db.command("sql", "CREATE PROPERTY Movie.movieId LONG")
db.command("sql", "CREATE PROPERTY Movie.title STRING")
db.command("sql", "CREATE INDEX ON Movie (movieId) UNIQUE")

# Create edge types
db.command("sql", "CREATE EDGE TYPE RATED UNIDIRECTIONAL IF NOT EXISTS")
db.command("sql", "CREATE PROPERTY RATED.rating FLOAT")
db.command("sql", "CREATE PROPERTY RATED.timestamp LONG")

db.command("sql", "CREATE EDGE TYPE TAGGED UNIDIRECTIONAL IF NOT EXISTS")
db.command("sql", "CREATE PROPERTY TAGGED.tag STRING")
db.command("sql", "CREATE PROPERTY TAGGED.timestamp LONG")

Step 2: Create Vertices (SQL)¶

# The example uses VertexCreator class for efficient batch creation
class VertexCreator:
    def _create_users(self, total_users: int):
        """Create User vertices from Rating data."""
        # Use paginated query to avoid memory issues
        last_rid = None
        while True:
            query = f"""SELECT DISTINCT userId FROM Rating
                        WHERE @rid > {last_rid} LIMIT {batch_size}"""
            users = list(db.query("sql", query))
            if not users:
                break

            for record in users:
                user_id = record.get("userId")
                db.command(
                    "sql",
                    "INSERT INTO User SET userId = ?, name = ?",
                    user_id,
                    f"User {user_id}",
                )

            last_rid = users[-1].get_identity()

# See full implementation in Python file for production-ready patterns

Step 3: Create Edges with Foreign Key Resolution¶

# Create RATED edges with efficient index lookups
class EdgeCreator:
    def _create_rated_edges(self, total_ratings: int):
        """Create RATED edges from Rating documents."""
        last_rid = None
        while True:
            # Paginated query
            query = f"""SELECT * FROM Rating
                        WHERE @rid > {last_rid} LIMIT {batch_size}"""
            ratings = list(db.query("sql", query))
            if not ratings:
                break

            for rating_doc in ratings:
                user_id = rating_doc.get("userId")
                movie_id = rating_doc.get("movieId")
                db.command(
                    "sql",
                    """
                    CREATE EDGE RATED
                    FROM (SELECT FROM User WHERE userId = ?)
                    TO (SELECT FROM Movie WHERE movieId = ?)
                    SET rating = ?, timestamp = ?
                    """,
                    user_id,
                    movie_id,
                    rating_doc.get("rating"),
                    rating_doc.get("timestamp"),
                )

            last_rid = ratings[-1].get_identity()

# See full implementation in Python file with batch processing

Step 4: Run Validation Queries¶

# Verify vertex counts
user_count = db.query("sql", "SELECT count(*) as c FROM User").first().get("c")
movie_count = db.query("sql", "SELECT count(*) as c FROM Movie").first().get("c")

# Verify edge counts
rated_count = db.query("sql", "SELECT count(*) as c FROM RATED").first().get("c")
tagged_count = db.query("sql", "SELECT count(*) as c FROM TAGGED").first().get("c")

print(f"✅ Users:  {user_count:,}")
print(f"✅ Movies: {movie_count:,}")
print(f"✅ RATED:  {rated_count:,}")
print(f"✅ TAGGED: {tagged_count:,}")

Running the Benchmark¶

Single Configuration¶

# Fastest configuration (recommended)
python 05_csv_import_graph.py --dataset movielens-small --method java --no-async

# Compare with SQL
python 05_csv_import_graph.py --dataset movielens-small --method sql

# With export for roundtrip validation
python 05_csv_import_graph.py --dataset movielens-small --batch-size 5000 --method java --no-async --export

Comprehensive Benchmark (All 6 Configurations)¶

# Run all 6 methods in parallel
./run_benchmark_05_csv_import_graph.sh movielens-small 5000 4 all_6

# With export and roundtrip validation
./run_benchmark_05_csv_import_graph.sh movielens-small 5000 4 all_6 --export

# Large dataset (takes several hours)
./run_benchmark_05_csv_import_graph.sh movielens-large 50000 4 all_6 --export

6 Configurations:

java - Java API with async and indexes
java_noasync - Java API synchronous with indexes ⚡ FASTEST
java_noindex - Java API with async, no indexes
java_noindex_noasync - Java API synchronous, no indexes
sql - SQL with indexes (always synchronous)
sql_noindex - SQL without indexes (always synchronous)

Benchmark Configuration¶

JVM Settings¶

Prefer configuring heap inside the script before it creates the first database:

import arcadedb_embedded as arcadedb

arcadedb.start_jvm(heap_size="8g", jvm_args="-Xms8g")

Memory Planning:

Heap size: 8 GB (-Xmx8g)
Total process memory: 15-19 GB (heap + non-heap)
Plan for 2× heap size in actual RAM

Batch Sizes¶

--batch-size 5000    # Small dataset (default)
--batch-size 50000   # Large dataset (recommended)

Larger batches = fewer commits = faster imports

Parallel Processing¶

--parallel 4         # CSV import parallelism (Example 04)

Note: Parallel async does NOT help for graph creation in embedded mode. Use synchronous Java API (--no-async) for best performance.

Next Steps¶

Example 06 - Vector Search: Movie Recommendations

Semantic similarity search with MovieLens data:

Generate embeddings from movie titles/genres
Build HNSW (JVector) index for nearest-neighbor search
Find similar movies using cosine distance
Combine vector similarity with rating data
Query: "Movies similar to X that users also liked"