Example 05: CSV Import - Graph¶
Production-ready graph creation from MovieLens dataset with comprehensive performance analysis
Overview¶
This example demonstrates creating a graph database from the MovieLens dataset, transforming documents into vertices and edges. You'll learn production-ready patterns for:
- Graph modeling - Users and Movies as vertices, ratings and tags as edges
- SQL pipeline - End-to-end graph creation with SQL DDL/DML
- Async vs Sync - Compare async vs sync modes in embedded mode
- Index optimization - Create indexes BEFORE bulk edge creation for 2-3× speedup
- Export & roundtrip validation - Verify data integrity through complete cycle
- Performance benchmarking - Measure and compare 6 different configurations
- Query validation - 10 graph queries with result verification
What You'll Learn¶
- Graph schema design (vertex types, edge types, properties)
- Foreign key resolution (userId → User vertex, movieId → Movie vertex)
- Synchronous is faster than async in embedded mode (no network I/O to parallelize)
- Indexes provide 2-3× speedup for edge creation
- Export/import performance and compression ratios (50-80×)
- Graph query patterns (MATCH, collaborative filtering, recommendations)
- Production import patterns for large-scale graphs
Prerequisites¶
1. Install ArcadeDB Python bindings:
2. Dataset download (automatic):
The example automatically downloads the dataset if it doesn't exist. You can also use a pre-existing document database from Example 04.
Two dataset sizes available:
- movielens-large: ~330K users, ~86K movies, ~33M edges (~971 MB CSV) - Realistic performance testing
- movielens-small: ~610 users, ~9,700 movies, ~100K edges (~3.2 MB CSV) - Quick testing
Usage¶
# Recommended SQL configuration (synchronous)
python 05_csv_import_graph.py --dataset movielens-small --method sql --no-async
# Run with export for roundtrip validation
python 05_csv_import_graph.py --dataset movielens-small --batch-size 5000 --method java --no-async --export
# Comprehensive benchmark (all 6 configurations in parallel)
./run_benchmark_05_csv_import_graph.sh movielens-small 5000 4 all_6 --export
# See all options
python 05_csv_import_graph.py --help
Key options:
--dataset {movielens-small,movielens-large}- Dataset size (default: movielens-small)--method {java,sql}- Creation method (DSL-first recommendation:sql)--no-async- Use synchronous transactions (FASTER for embedded mode)--no-index- Skip creating indexes (slower, for comparison)--batch-size BATCH_SIZE- Batch size for operations (default: 5000)--parallel PARALLEL- Async executor threads (default: 4, not recommended)--export- Export graph database to JSONL after creation--source-db SOURCE_DB- Custom source database path--import-jsonl IMPORT_JSONL- Import from JSONL export instead of CSV
Recommendations:
- Method: Use
sqlfor consistency with DSL-first examples and guides - Async: Use
--no-asyncfor best performance (2.5× faster in embedded mode) - Indexes: Keep enabled (2-3× speedup for edge creation)
- Batch size: 5000 for small, 50000 for large datasets
- Export: Use
--exportto validate data integrity via roundtrip
Graph Schema¶
Vertex Types¶
User
- Properties:
userId(LONG, indexed),name(STRING) - Count: 610 (small) / 330,000 (large)
Movie
- Properties:
movieId(LONG, indexed),title(STRING),genres(STRING),imdbId(STRING),tmdbId(STRING) - Count: 9,742 (small) / 86,537 (large)
Edge Types¶
RATED (User → Movie)
- Properties:
rating(FLOAT),timestamp(LONG) - Count: 98,734 (small) / 33,155,309 (large)
TAGGED (User → Movie)
- Properties:
tag(STRING),timestamp(LONG) - Count: 3,494 (small) / 2,212,213 (large)
Performance Results¶
Small Dataset (610 users, 9,742 movies, 102,228 edges)¶
| Method | Vertices | Edges | Creation Time | Memory (Peak) |
|---|---|---|---|---|
| java_noasync ⚡ | 11,528/s | 6,927/s | 15.65s | 5.5 GB |
| java_noindex_noasync | 12,514/s | 6,766/s | 15.94s | 5.5 GB |
| java (async) | 4,453/s | 6,516/s | 18.01s | 5.6 GB |
| java_noindex (async) | 4,255/s | 6,143/s | 19.07s | 5.0 GB |
| sql_noindex | 5,383/s | 5,225/s | 21.49s | 5.6 GB |
| sql | 4,882/s | 4,956/s | 22.75s | 5.5 GB |
Large Dataset (330K users, 86K movies, 35.4M edges)¶
| Method | Vertices | Edges | Creation Time | Memory (Peak) |
|---|---|---|---|---|
| java_noasync ⚡ | 12,341/s | 5,071/s | 1h 57m | 16.0 GB |
| sql | 8,734/s | 3,789/s | 2h 36m | 16.0 GB |
| java (async) | 2,469/s | 2,034/s | 4h 52m | 18.4 GB |
| java_noindex_noasync | 13,036/s | 1,839/s | 5h 21m | 15.7 GB |
| sql_noindex | 8,902/s | 1,773/s | 5h 33m | 15.6 GB |
| java_noindex (async) | 2,662/s | 820/s | 12h 1m | 18.3 GB |
Key Performance Insights¶
1. 🚀 Synchronous Java API is FASTEST for Bulk Edge Creation¶
Winner: java_noasync (5,071 edges/sec)
Large Dataset Edge Creation:
✅ java_noasync: 5,071 edges/sec ← FASTEST (sync Java API)
✅ sql: 3,789 edges/sec (25% slower)
❌ java (async): 2,034 edges/sec (60% slower)
Why? In embedded mode, there's no network I/O to parallelize. Async machinery (futures, thread coordination) adds overhead without benefits. Direct synchronous Java API calls are the fastest path.
2. 📊 Indexes Provide 2-3× Speedup¶
Java API Edge Creation (Large Dataset):
✅ WITH indexes (java_noasync): 5,071 edges/sec ← 2.8× faster
❌ WITHOUT indexes: 1,839 edges/sec
SQL Edge Creation (Large Dataset):
✅ WITH indexes (sql): 3,789 edges/sec ← 2.1× faster
❌ WITHOUT indexes: 1,773 edges/sec
Best Practice: Create indexes BEFORE bulk edge creation (unlike documents where indexes come after).
3. ⚡ Async Hurts Performance in Embedded Mode¶
Java API with Indexes (Large Dataset):
✅ Synchronous (java_noasync): 5,071 edges/sec ← 2.5× faster
❌ Async (java): 2,034 edges/sec
Vertex Creation:
✅ Synchronous: 12,341 vertices/sec ← 5× faster
❌ Async: 2,469 vertices/sec
Why? Async is designed for I/O-bound operations (network, disk). In embedded mode, all operations are in-memory JVM calls. Async overhead (thread pools, futures, synchronization) wastes CPU cycles.
4. 🎯 DSL-First: Use SQL for Ingestion and Queries¶
Use SQL For:
- ✅ Bulk edge/vertex creation with one consistent DSL across examples
- ✅ Complex graph queries (MATCH patterns, multi-hop traversals)
- ✅ Aggregations (GROUP BY, COUNT, AVG)
- ✅ Ad-hoc analysis and prototyping
- ✅ When readability and portability matter
- ✅ Cypher compatibility (Neo4j migration path)
5. 💾 Memory Usage: Heap vs Total Process Memory¶
Large Dataset Memory (8GB JVM Heap):
Peak RSS (actual memory): 15.6 - 18.8 GB ← Total process memory
JVM Heap (-Xmx): 8.0 GB ← Just the heap portion
Total = Heap + Non-Heap (metaspace, page cache, thread stacks, direct buffers)
Insight: JVM heap setting (-Xmx) only limits heap memory. Total process memory
includes metaspace, page cache, thread stacks, and direct buffers. Plan for 1.5-2× your
heap size in actual RAM.
Export & Roundtrip Validation¶
Export Performance¶
Large Dataset Export (35.4M edges):
✅ java_noasync: 122.53s (288,815 records/sec)
✅ sql: 187.54s (188,682 records/sec)
Compression Ratio: 50-80× (JSONL → gzip)
File Size: 670 MB compressed (from ~33 GB uncompressed)
Roundtrip Import Performance¶
Large Dataset Import from JSONL:
✅ java_noasync: 1,365.41s (25,906 records/sec)
✅ sql: 1,783.64s (19,843 records/sec)
Total Roundtrip (Export + Import):
✅ java_noasync: 1,487.94s (24.8 min)
✅ sql: 1,971.18s (32.9 min)
Validation: All roundtrip imports passed 10 query validations, confirming data integrity.
Graph Queries¶
The example includes 10 comprehensive graph queries (8 SQL + 2 OpenCypher):
Query 1: Count Vertices by Type¶
Query 2: Count Edges by Type¶
Query 3: Sample User Ratings¶
Query 4: Sample User Tags¶
Query 5: High-Rated Movies¶
MATCH {type: User}-RATED->{type: Movie, as: m}
RETURN m.title, m.movieId, m.genres
ORDER BY m.movieId
LIMIT 10
Query 6: Collaborative Filtering (Most Complex)¶
MATCH {type: User, where: (userId = 1)}-RATED->{type: Movie}<-RATED-{type: User, as: otherUser}
RETURN otherUser.userId, count(*) as shared_movies
GROUP BY otherUser.userId
ORDER BY shared_movies DESC
LIMIT 10
Query 7-8: Additional SQL Patterns¶
- Query 7: Users with similar taste (SQL MATCH + aggregation)
- Query 8: Rating distribution (SQL aggregation, filters NULL ratings)
Query 9-10: OpenCypher Patterns¶
- Query 9: User's top-rated movies (OpenCypher traversal with filtering)
- Query 10: Collaborative filtering (OpenCypher aggregation)
Note: These queries use OpenCypher patterns as an alternative to SQL MATCH syntax.
Code Walkthrough¶
Step 1: Create Graph Schema¶
# Create vertex types
db.command("sql", "CREATE VERTEX TYPE User IF NOT EXISTS")
db.command("sql", "CREATE PROPERTY User.userId LONG")
db.command("sql", "CREATE PROPERTY User.name STRING")
db.command("sql", "CREATE INDEX ON User (userId) UNIQUE")
db.command("sql", "CREATE VERTEX TYPE Movie IF NOT EXISTS")
db.command("sql", "CREATE PROPERTY Movie.movieId LONG")
db.command("sql", "CREATE PROPERTY Movie.title STRING")
db.command("sql", "CREATE INDEX ON Movie (movieId) UNIQUE")
# Create edge types
db.command("sql", "CREATE EDGE TYPE RATED UNIDIRECTIONAL IF NOT EXISTS")
db.command("sql", "CREATE PROPERTY RATED.rating FLOAT")
db.command("sql", "CREATE PROPERTY RATED.timestamp LONG")
db.command("sql", "CREATE EDGE TYPE TAGGED UNIDIRECTIONAL IF NOT EXISTS")
db.command("sql", "CREATE PROPERTY TAGGED.tag STRING")
db.command("sql", "CREATE PROPERTY TAGGED.timestamp LONG")
Step 2: Create Vertices (SQL)¶
# The example uses VertexCreator class for efficient batch creation
class VertexCreator:
def _create_users(self, total_users: int):
"""Create User vertices from Rating data."""
# Use paginated query to avoid memory issues
last_rid = None
while True:
query = f"""SELECT DISTINCT userId FROM Rating
WHERE @rid > {last_rid} LIMIT {batch_size}"""
users = list(db.query("sql", query))
if not users:
break
for record in users:
user_id = record.get("userId")
db.command(
"sql",
"INSERT INTO User SET userId = ?, name = ?",
user_id,
f"User {user_id}",
)
last_rid = users[-1].get_identity()
# See full implementation in Python file for production-ready patterns
Step 3: Create Edges with Foreign Key Resolution¶
# Create RATED edges with efficient index lookups
class EdgeCreator:
def _create_rated_edges(self, total_ratings: int):
"""Create RATED edges from Rating documents."""
last_rid = None
while True:
# Paginated query
query = f"""SELECT * FROM Rating
WHERE @rid > {last_rid} LIMIT {batch_size}"""
ratings = list(db.query("sql", query))
if not ratings:
break
for rating_doc in ratings:
user_id = rating_doc.get("userId")
movie_id = rating_doc.get("movieId")
db.command(
"sql",
"""
CREATE EDGE RATED
FROM (SELECT FROM User WHERE userId = ?)
TO (SELECT FROM Movie WHERE movieId = ?)
SET rating = ?, timestamp = ?
""",
user_id,
movie_id,
rating_doc.get("rating"),
rating_doc.get("timestamp"),
)
last_rid = ratings[-1].get_identity()
# See full implementation in Python file with batch processing
Step 4: Run Validation Queries¶
# Verify vertex counts
user_count = db.query("sql", "SELECT count(*) as c FROM User").first().get("c")
movie_count = db.query("sql", "SELECT count(*) as c FROM Movie").first().get("c")
# Verify edge counts
rated_count = db.query("sql", "SELECT count(*) as c FROM RATED").first().get("c")
tagged_count = db.query("sql", "SELECT count(*) as c FROM TAGGED").first().get("c")
print(f"✅ Users: {user_count:,}")
print(f"✅ Movies: {movie_count:,}")
print(f"✅ RATED: {rated_count:,}")
print(f"✅ TAGGED: {tagged_count:,}")
Running the Benchmark¶
Single Configuration¶
# Fastest configuration (recommended)
python 05_csv_import_graph.py --dataset movielens-small --method java --no-async
# Compare with SQL
python 05_csv_import_graph.py --dataset movielens-small --method sql
# With export for roundtrip validation
python 05_csv_import_graph.py --dataset movielens-small --batch-size 5000 --method java --no-async --export
Comprehensive Benchmark (All 6 Configurations)¶
# Run all 6 methods in parallel
./run_benchmark_05_csv_import_graph.sh movielens-small 5000 4 all_6
# With export and roundtrip validation
./run_benchmark_05_csv_import_graph.sh movielens-small 5000 4 all_6 --export
# Large dataset (takes several hours)
./run_benchmark_05_csv_import_graph.sh movielens-large 50000 4 all_6 --export
6 Configurations:
java- Java API with async and indexesjava_noasync- Java API synchronous with indexes ⚡ FASTESTjava_noindex- Java API with async, no indexesjava_noindex_noasync- Java API synchronous, no indexessql- SQL with indexes (always synchronous)sql_noindex- SQL without indexes (always synchronous)
Benchmark Configuration¶
JVM Settings¶
Prefer configuring heap inside the script before it creates the first database:
Memory Planning:
- Heap size: 8 GB (
-Xmx8g) - Total process memory: 15-19 GB (heap + non-heap)
- Plan for 2× heap size in actual RAM
Batch Sizes¶
Larger batches = fewer commits = faster imports
Parallel Processing¶
Note: Parallel async does NOT help for graph creation in embedded mode. Use synchronous Java API (--no-async) for best performance.
Next Steps¶
Example 06 - Vector Search: Movie Recommendations
Semantic similarity search with MovieLens data:
- Generate embeddings from movie titles/genres
- Build HNSW (JVector) index for nearest-neighbor search
- Find similar movies using cosine distance
- Combine vector similarity with rating data
- Query: "Movies similar to X that users also liked"