Example 04: CSV Import - Tables (Documents)¶
Production-ready CSV import with SQL import, automatic type inference, NULL handling, and index optimization
Overview¶
This example demonstrates importing real-world CSV data from the MovieLens dataset into ArcadeDB documents. You'll learn production-ready patterns for:
- Automatic type inference - Java analyzes CSV and infers optimal ArcadeDB types
- SQL import workflow - Python drives import through
IMPORT DATABASE - NULL value handling - Import and query missing data across all types
- Batch processing - Optimize import performance with commit batching
- Index optimization - Create indexes AFTER import for maximum throughput
- Performance analysis - Measure query speedup with statistical validation
- Result validation - Verify indexes return identical results with actual data samples
What You'll Learn¶
- Automatic type inference by the Java-side SQL import path (LONG, DOUBLE, STRING)
- SQL-driven import from Python via
db.command("sql", "IMPORT DATABASE ...") - NULL value import from empty CSV cells
- Query performance measurement (10 runs with statistics)
- Index creation timing (before vs after import)
- Composite indexes for multi-column queries
- Result validation with actual data samples
- Production import patterns for large datasets
Prerequisites¶
1. Install ArcadeDB Python bindings:
2. Dataset download (automatic):
The example automatically downloads the dataset if it doesn't exist. You can also download it manually:
cd bindings/python/examples
python download_data.py movielens-large # movielens large dataset
python download_data.py movielens-small # movielens small dataset
Note: You must download the dataset before running this example if the auto-download is disabled or blocked by your environment.
Two dataset sizes available:
- movielens-large: ~86,000 movies, ~33M ratings (~265 MB) - Realistic performance testing
- movielens-small: ~9,700 movies, ~100,000 ratings (~1 MB) - Quick testing
Both datasets include intentional NULL values for testing:
movies.csv: ~2% NULL genres, ~0.5% NULL titlesratings.csv: ~3% NULL timestamps, ~1% NULL ratingslinks.csv: ~5% NULL imdbId, ~8% NULL tmdbIdtags.csv: ~5% NULL tags, ~2% NULL timestamps
Dataset Structure¶
MovieLens Large - 4 CSV files:
| File | Records | Columns | Description |
|---|---|---|---|
movies.csv |
86,537 | movieId, title, genres | Movie metadata |
ratings.csv |
33,832,162 | userId, movieId, rating, timestamp | User ratings |
links.csv |
86,537 | movieId, imdbId, tmdbId | External IDs (with NULLs) |
tags.csv |
2,328,315 | userId, movieId, tag, timestamp | User tags (with NULLs) |
Total records: 36,333,551 documents
For quick testing with the smaller dataset (124,003 records), use: python download_data.py movielens-small
Usage¶
# Basic usage (large dataset by default)
python 04_csv_import_documents.py
# Use small dataset for quick testing
python 04_csv_import_documents.py --dataset movielens-small
# Configure parallel threads and batch size
python 04_csv_import_documents.py --parallel 8 --batch-size 10000
# Export database for reproducibility
python 04_csv_import_documents.py --export
# See all options
python 04_csv_import_documents.py --help
Key options:
--dataset {movielens-small,movielens-large}- Dataset size (default: movielens-large)--parallel PARALLEL- Number of parallel import threads (default: auto-detect)--batch-size BATCH_SIZE- Records per commit batch (default: 5000)--export- Export database to JSONL after import--db-name DB_NAME- Custom database name (default: movielens_{size}_db)
Recommendations:
- Parallel threads: 4-8 for best performance (auto-detected by default)
- Batch size: 5000-50000 (larger = faster imports, more memory)
- Export: Use
--exportto create reproducible benchmark databases
Type Inference by Java¶
The example uses automatic type inference by the Java-side SQL import path, which analyzes the data and selects optimal ArcadeDB types:
Example Inference Results (movielens-large)¶
📋 Movie (movies.csv):
• movieId: LONG (e.g., '1')
• title: STRING (e.g., 'Toy Story (1995)')
• genres: STRING (e.g., 'Adventure|Animation|Children|Comedy|Fantasy')
📋 Rating (ratings.csv):
• userId: LONG (e.g., '1')
• movieId: LONG (e.g., '1')
• rating: DOUBLE (e.g., '4.0')
• timestamp: LONG (e.g., '964982703')
📋 Link (links.csv):
• movieId: LONG (e.g., '1')
• imdbId: LONG (e.g., '') ← NULL value
• tmdbId: LONG (e.g., '862')
📋 Tag (tags.csv):
• userId: LONG (e.g., '2')
• movieId: LONG (e.g., '60756')
• tag: STRING (e.g., 'funny')
• timestamp: LONG (e.g., '1445714994')
Code Walkthrough¶
Step 1: Check Dataset Availability¶
data_dir = Path(__file__).parent / "data" / "movielens-small"
if not data_dir.exists():
print("❌ MovieLens dataset not found!")
print("💡 Please download the dataset first:")
print(" python download_data.py")
exit(1)
Step 2: Let the SQL Import Path Infer Types Automatically¶
The Java-side import path behind SQL IMPORT DATABASE automatically analyzes the CSV data
and infers optimal ArcadeDB types (LONG, DOUBLE, STRING). No manual type inference code
is needed.
This example still creates the target schema intentionally so the import target and index plan stay explicit.
Step 3: Import CSV Files Directly¶
# Import with batch commits for performance
import_options = {
"commitEvery": args.batch_size, # Batch size for commits
}
stats = import_csv_documents_via_sql(db, movies_csv, "Movie", **import_options)
# Check for NULL values (using .first() for efficiency)
null_genres = (
db.query("sql", "SELECT count(*) as c FROM Movie WHERE genres IS NULL")
.first()
.get("c")
)
if null_genres > 0:
print(f" 🔍 NULL values detected:")
print(f" • genres: {null_genres:,} NULL values ({null_genres/stats['documents']*100:.1f}%)")
print(" 💡 Empty CSV cells correctly imported as SQL NULL")
Performance results (small dataset):
- Movies: 105,891 records/sec
- Ratings: 484,788 records/sec (largest file, highly optimized)
- Links: 374,692 records/sec
- Tags: 167,409 records/sec
- Total: 356,330 records/sec average
Performance results (large dataset):
- Movies: 288,457 records/sec
- Ratings: 908,832 records/sec (largest file, highly optimized)
- Links: 697,879 records/sec
- Tags: 739,148 records/sec
- Total: 890,528 records/sec average
Key insight: Larger datasets show better performance due to more efficient batch processing and reduced per-record overhead.
Step 8: Query Performance WITHOUT Indexes¶
test_queries = [
("Find movie by ID", "SELECT FROM Movie WHERE movieId = 500"),
("Find user's ratings", "SELECT FROM Rating WHERE userId = 414 ORDER BY movieId, rating LIMIT 10"),
("Find movie ratings", "SELECT FROM Rating WHERE movieId = 500 ORDER BY userId, rating LIMIT 10"),
("Count user's ratings", "SELECT count(*) as count FROM Rating WHERE userId = 414"),
("Find movies by genre", "SELECT FROM Movie WHERE genres LIKE '%Action%' ORDER BY movieId LIMIT 10"),
]
# Run each query 10 times for statistical reliability
for query_name, query in test_queries:
run_times = []
for _ in range(10):
query_start = time.time()
result = list(db.query("sql", query))
run_times.append(time.time() - query_start)
avg_time = statistics.mean(run_times)
std_time = statistics.stdev(run_times)
print(f" 📊 {query_name}:")
print(f" Average: {avg_time*1000:.2f}ms ± {std_time*1000:.2f}ms")
Step 9: Create Indexes (AFTER Import)¶
db.command("sql", "CREATE INDEX ON Movie (movieId) UNIQUE")
db.command("sql", "CREATE INDEX ON Rating (userId, movieId) NOTUNIQUE") # Composite!
db.command("sql", "CREATE INDEX ON Link (movieId) UNIQUE")
db.command("sql", "CREATE INDEX ON Tag (movieId) NOTUNIQUE")
Why create indexes AFTER import?
- 2-3x faster total time
- Indexes built in one pass
- Fully compacted from start
- Production best practice
Step 10: Query Performance WITH Indexes¶
Same queries, now with indexes active. Results show dramatic speedup!
Performance Results¶
Query Speedup Summary¶
🚀 Performance Improvement Summary:
======================================================================
Query Before (ms) After (ms) Speedup
======================================================================
Find movie by ID 39.1±7.6 0.7±1.9 58.1x
(98.3% time saved)
Find user's ratings 16003.9±79.4 1.1±0.4 14,836x
(100.0% time saved)
Find movie ratings 16524.8±124.2 153.1±22.2 107.9x
(99.1% time saved)
Count user's ratings 16004.3±138.3 0.8±1.4 19,604x
(100.0% time saved)
Find movies by genre 1.0±1.0 0.8±0.3 1.3x
(23.7% time saved)
Count ALL Action movies 58.7±8.4 65.1±18.0 0.9x
(-10.8% time saved)
======================================================================
Key findings:
- ✅ Composite indexes show massive gains (up to 19,604x speedup!)
- ✅ Single column lookups are very fast (58.1x speedup)
- ✅ Standard deviation shows query stability
NULL Value Handling¶
The example demonstrates comprehensive NULL handling:
Import Results¶
Step 2: Importing movies.csv → Movie documents...
✅ Imported 86,537 movies
🔍 NULL values detected:
• genres: 2,584 NULL values (3.0%)
💡 Empty CSV cells correctly imported as SQL NULL
Step 4: Importing ratings.csv → Rating documents...
✅ Imported 33,832,162 ratings
🔍 NULL values detected:
• timestamp: 675,782 NULL values (2.0%)
💡 Empty CSV cells correctly imported as SQL NULL
Step 5: Importing links.csv → Link documents...
✅ Imported 86,537 links
🔍 NULL values detected:
• imdbId: 8,628 NULL values (10.0%)
• tmdbId: 13,026 NULL values (15.1%)
💡 Empty CSV cells correctly imported as SQL NULL
Step 6: Importing tags.csv → Tag documents...
✅ Imported 2,328,315 tags
🔍 NULL values detected:
• tag: 116,644 NULL values (5.0%)
💡 Empty CSV cells correctly imported as SQL NULL
NULL Values in Aggregations¶
💬 Top 10 most common tags:
1. 'None' (116,644 uses) ← NULL tags appear as 'None'
2. 'sci-fi' (13,612 uses)
...
🎭 Top 10 genres by movie count:
1. Drama (11,857 movies)
...
6. None (2,584 movies) ← NULL genres appear as 'None'
Total NULL values: ~799,638 across all fields (LONG and STRING types)
Index Architecture¶
ArcadeDB uses LSM-Tree (Log-Structured Merge Tree) for all indexes - a single unified backend that handles all data types efficiently.
How LSM-Tree Works¶
Architecture:
LSMTreeIndex
├── Mutable Buffer (in-memory)
│ └── Recent writes, fast inserts
│
└── Compacted Storage (disk)
└── Sorted, immutable data
Key advantages:
- Write-optimized: Sequential writes to memory buffer (perfect for bulk imports)
- Type-agnostic: Same structure for all types, type-aware comparison during lookups
- Auto-compaction: Background merging keeps data sorted and compact
- Transaction-friendly: Buffers writes until commit
Type Performance & Storage¶
Binary serialization per type:
| Type | Storage Size | Comparison Speed | Best For |
|---|---|---|---|
| BYTE | 1 byte | ⚡ Very fast | Flags, small counts (0-255) |
| SHORT | 2 bytes | ⚡ Very fast | Medium numbers (-32K to 32K) |
| INTEGER | 4 bytes | ⚡ Very fast | IDs, standard numbers (up to 2B) |
| LONG | 8 bytes | ⚡ Very fast | Large IDs, timestamps |
| FLOAT | 4 bytes | ⚡ Fast | Small decimals (7-digit precision) |
| DOUBLE | 8 bytes | ⚡ Fast | Standard decimals (15-digit precision) |
| DATE/DATETIME | 8 bytes (as LONG) | ⚡ Fast | Timestamps, dates |
| STRING | Variable | 🐌 Slower | Text, byte-by-byte comparison |
| DECIMAL | Variable | 🐌 Slowest | Exact precision (e.g., money) |
Index space example (100K records):
- BYTE: 100KB | SHORT: 200KB | INTEGER: 400KB (best balance) | LONG: 800KB | STRING(20): 2MB+
Why this matters:
- Smaller types = more keys per page = better cache performance
- Fixed-size types = faster comparison = better query speed
- Choose INTEGER for most IDs (handles 2 billion values, compact, fast)
Analysis Queries¶
The example includes comprehensive data analysis:
Record Counts¶
SELECT count(*) as count FROM Movie # 9,742 movies
SELECT count(*) as count FROM Rating # 100,836 ratings
SELECT count(*) as count FROM Link # 9,742 links
SELECT count(*) as count FROM Tag # 3,683 tags
Rating Statistics¶
SELECT
count(*) as total_ratings,
avg(rating) as avg_rating,
min(rating) as min_rating,
max(rating) as max_rating
FROM Rating
# Results: 100,836 ratings, avg 3.50 ★, range 0.5-5.0
Rating Distribution¶
SELECT rating, count(*) as count
FROM Rating
GROUP BY rating
ORDER BY rating
# Results:
# 0.5 ★ : 1,370
# 1.0 ★ : 2,811
# ...
# 4.0 ★ : 26,818 (most common)
# 5.0 ★ : 13,211
Top Genres¶
SELECT genres, count(*) as count
FROM Movie
WHERE genres <> '(no genres listed)'
GROUP BY genres
ORDER BY count DESC
LIMIT 10
# Results: Drama (1,022), Comedy (922), Comedy|Drama (429), ...
Most Active Users¶
SELECT userId, count(*) as rating_count
FROM Rating
GROUP BY userId
ORDER BY rating_count DESC
LIMIT 10
# Results: User 414 (2,698 ratings), User 599 (2,478), ...
Best Practices Demonstrated¶
✅ Type Inference by Java¶
- Java-side SQL import path automatically analyzes data and selects optimal types
- Handles LONG, DOUBLE, STRING intelligently based on actual values
- No manual type inference code needed
- Schema-on-write simplifies development
✅ Schema Definition¶
- Define schema BEFORE import (validation + optimization)
- Use explicit property types (no guessing)
- Choose appropriate types for data ranges
✅ Import Optimization¶
- Use
commit_everyparameter for batching - Larger batches = faster imports (balance with memory)
- Movies: commit_every=1000 (smaller batches)
- Ratings: commit_every=5000 (larger dataset, bigger batches)
✅ Index Strategy¶
- CREATE INDEXES AFTER IMPORT (2-3x faster total time)
- Use composite indexes for multi-column queries
- Order matters: most selective column first
- Index creation timing: ~0.2 seconds for 124K records
✅ Performance Measurement¶
- Run queries 10 times for statistical reliability
- Calculate average, standard deviation, min, max
- Compare before/after index performance
- Measure speedup percentages
✅ NULL Handling¶
- Empty CSV cells → SQL NULL automatically
- Check NULL counts after import
- NULL values appear in aggregations (as 'None' string)
- Support NULL across all types (STRING, INTEGER, etc.)
Running the Example¶
cd bindings/python/examples
# Use default (large) dataset - downloads automatically if needed
python 04_csv_import_documents.py
# Use small dataset for quick testing - downloads automatically if needed
python 04_csv_import_documents.py --dataset movielens-small
# Use large dataset explicitly
python 04_csv_import_documents.py --dataset movielens-large
# With custom JVM heap for large datasets
python 04_csv_import_documents.py --dataset movielens-large --heap-size 8g
Command-line options:
--dataset {movielens-small,movielens-large}- Dataset size to use (default: movielens-large)--heap-size SIZE- JVM max heap size (e.g.8g,4096m)- The script automatically downloads the dataset if it doesn't exist
Expected output:
- Automatic dataset download if needed
- Step-by-step import progress
- NULL value detection for all 4 files
- Performance statistics (before/after indexes)
- Data analysis queries with results
- Total time: ~2-3 minutes (movielens-large) or ~5 seconds (movielens-small)
Database location:
- Small dataset:
./my_test_databases/movielens_small_db/ - Large dataset:
./my_test_databases/movielens_large_db/
The database is preserved for inspection after the example completes.
Database size:
- movielens-large: ~2.0 GB database from ~971 MB CSV files (~2.1x expansion)
- movielens-small: ~27 MB database from ~3.2 MB CSV files (~8.4x expansion)
⚠️ Note: Database files are larger than source CSVs due to:
- Index structures (LSM-tree, hash, or other index data depending on what you create)
- Transaction logs and metadata
- Internal data structures for document storage
- WAL (Write-Ahead Log) files for durability
Key Takeaways¶
- ✅ Automatic type inference by Java provides intelligent LONG/DOUBLE/STRING selection
- ✅ Schema-on-write simplifies development (no manual schema creation needed)
- ✅ NULL value handling works seamlessly across all data types (STRING, INTEGER, etc.)
- ✅ Batch processing (
commit_every) dramatically improves import performance - ✅ Create indexes AFTER import - 2-3x faster than indexing during import
- ✅ Indexes provide massive performance gains (up to 14,836x speedup!). For
exact-match lookups, prefer
UNIQUE_HASH/NOTUNIQUE_HASH; for ranges and ordered scans, preferUNIQUE/NOTUNIQUE(LSM_TREE). - ✅ Statistical validation (10 runs) ensures reliable performance measurements
- ✅ Result validation compares actual data values, not just row counts
- ✅ Multi-bucket architecture creates 15 buckets per type, 1 index file per bucket per property
- ✅ Database persistence - reopen and query immediately, no rebuild needed!
Next Steps¶
- Try Example 05: Graph import (MovieLens as vertices and edges)
- Experiment: Modify
commit_everyvalues to see performance impact - Add queries: Try your own analysis queries on the dataset
- Index tuning: Create different index combinations and measure speedup
- Type testing: Change type inference rules and observe import behavior
Related Examples¶
- Example 01 - Simple Document Store - Document basics and CRUD
- Example 02 - Social Network Graph - Graph modeling with NULL handling
- Example 03 - Vector Search - Semantic similarity search
Dataset License: MovieLens data is provided by GroupLens Research and is free to use for educational purposes. See https://grouplens.org/datasets/movielens/ for details.