Data Import Guide¶
This guide covers the currently available import workflows in the Python bindings.
The current bindings are SQL-first. The old broad Python Importer, import_csv(),
and import_xml() surface is intentionally not part of the current public API. The
remaining import surface is SQL IMPORT DATABASE plus the narrow
db.import_documents(...) helper for document-file loads, but this repository still
does not encourage leaning on importer-based paths heavily from Python.
Use it when you need its supported file-import behavior or a full
EXPORT DATABASE + IMPORT DATABASE restore flow. For large Python-side table/document
ingest, prefer async SQL with a single async worker.
Overview¶
The currently exposed importer paths are ArcadeDB's native SQL importer plus a narrow Python wrapper for document imports.
- CSV document imports
- CSV graph imports for vertices and edges
- XML imports
- Neo4j imports
- Word2Vec imports for vector workflows
- RDF imports
- Timeseries imports
- Full ArcadeDB JSONL restore
All of these are exercised through db.command("sql", "IMPORT DATABASE ...") in
the current test suite, but test coverage should not be read as a recommendation to make
this your default Python ingest path.
db.import_documents(...) is also covered by dedicated API tests, but it should be read
the same way: supported, not currently encouraged as the default Python ingest path.
Bulk Ingest Recommendation¶
IMPORT DATABASE is available, but it is not the recommended path for very large
Python-side bulk ingest workloads in this repository right now, and more broadly it is
not something we currently encourage as the default Python import story.
- Example 15 plus the larger table examples are the basis for the current repository guidance.
- For bulk table/document ingest, async SQL with
--async-parallel 1is the recommended default. - Do not rely on multi-threaded async SQL insert for this path in the current Python examples. It has not been safe or reliable in testing.
db.import_documents(...)exists for document-shaped file import convenience, but in current Python testing it has also shown reliability problems under heavier loads.- Reserve
IMPORT DATABASEfor supported import formats, restore flows, and cases where you explicitly need that importer behavior from Python. - For bulk graph ingest, use
GraphBatchinstead; see Example 16 and the graph examples.
Example 15 and 16 Benchmark Structure¶
Example 15 and 16 are the repository's focused ingest comparison harnesses.
- Example 15 includes the
db.import_documents(...)wrapper in addition to the main SQL-based ingestion paths. - Example 15 targets multi-table/document ingest.
- Example 16 targets graph ingest with vertices plus edges.
- Both examples enforce parity checks on the final loaded data so timing comparisons are only accepted when the result counts match the expected shape.
- Example 15 is useful for comparison, but the repository recommendation for bulk table/document ingest is still single-worker async SQL.
Quick Start¶
Import CSV as Documents¶
from pathlib import Path
import arcadedb_embedded as arcadedb
def file_url(path: str) -> str:
return Path(path).resolve().as_uri()
with arcadedb.create_database("./mydb") as db:
db.command("sql", "CREATE DOCUMENT TYPE Movie")
db.command(
"sql",
f"IMPORT DATABASE {file_url('./movies.csv')} WITH documentType = 'Movie', commitEvery = 5000",
)
Import CSV as Vertices and Edges¶
from pathlib import Path
import arcadedb_embedded as arcadedb
def file_url(path: str) -> str:
return Path(path).resolve().as_uri()
with arcadedb.create_database("./graphdb") as db:
db.command("sql", "CREATE VERTEX TYPE Person")
db.command("sql", "CREATE EDGE TYPE Follows")
db.command(
"sql",
(
"IMPORT DATABASE WITH "
f"vertices = '{file_url('./people.csv')}', "
"vertexType = 'Person', "
"typeIdProperty = 'id', "
"typeIdType = 'Long', "
"typeIdUnique = true"
),
)
db.command(
"sql",
(
"IMPORT DATABASE WITH "
f"edges = '{file_url('./follows.csv')}', "
"edgeType = 'Follows', "
"typeIdProperty = 'id', "
"typeIdType = 'Long', "
"edgeFromField = 'from', "
"edgeToField = 'to'"
),
)
Restore an ArcadeDB Export¶
with arcadedb.create_database("./restored") as db:
db.command(
"sql",
"IMPORT DATABASE file:///exports/mydb.jsonl.tgz WITH commitEvery = 50000",
)
Choosing the Right Import Mode¶
CSV¶
Use CSV imports for flat tabular data, graph vertex/edge feeds, and time-series style ingestion.
Best fit:
- spreadsheet-style datasets
- relational exports
- graph nodes and edges split across files
- moderate-size file imports and format-driven imports
ArcadeDB JSONL Export¶
Use JSONL exports when you want a full database restore with schema and data intact.
Best fit:
- environment-to-environment migration
- backups and restore drills
- reproducible benchmark datasets
XML, RDF, Neo4j, Word2Vec¶
These formats are also driven through SQL IMPORT DATABASE. See the import tests for
working examples of the exact option sets used in the bindings.
Schema Strategy¶
Recommended: Pre-create the Schema¶
Create types, properties, and indexes before importing when you care about data types, constraints, and predictable query plans.
with arcadedb.create_database("./mydb") as db:
db.command("sql", "CREATE DOCUMENT TYPE User")
db.command("sql", "CREATE PROPERTY User.id LONG")
db.command("sql", "CREATE PROPERTY User.email STRING")
db.command("sql", "CREATE INDEX ON User (id) UNIQUE")
db.command("sql", "CREATE INDEX ON User (email) UNIQUE")
db.command(
"sql",
"IMPORT DATABASE file:///data/users.csv WITH documentType = 'User', commitEvery = 10000",
)
Fast Start: Let the Import Create the Target Type¶
For exploratory work, you can import into a new type with minimal setup. This is quick, but you give up explicit control over schema shape and validation.
Performance Guidance¶
Tune commitEvery¶
Use larger commit batches for throughput and smaller ones for lower memory pressure when you do choose the SQL import path.
| Dataset Size | Recommended commitEvery |
|---|---|
| < 100K rows | 1,000 to 10,000 |
| 100K to 1M rows | 10,000 to 50,000 |
| > 1M rows | 50,000 to 100,000 |
Drop Heavy Indexes Before Bulk Loads¶
For large one-shot imports, remove expensive indexes first and recreate them afterward.
For the largest Python benchmark ingest paths in this repo, prefer async SQL with a
single async worker instead of leaning on IMPORT DATABASE.
db.command("sql", "DROP INDEX `User[email]`")
db.command(
"sql",
"IMPORT DATABASE file:///data/users.csv WITH documentType = 'User', commitEvery = 50000",
)
db.command("sql", "CREATE INDEX ON User (email) UNIQUE")
Validate the Input Up Front¶
Check required columns and data cleanliness before starting a long import run. The SQL importer will fail fast on malformed configurations, but it is still cheaper to catch obvious CSV issues before JVM work starts.
Relationship Mapping¶
For graph imports, load vertices first, then edges, and use matching ID fields.
db.command(
"sql",
(
"IMPORT DATABASE WITH "
"vertices = 'file:///data/users.csv', "
"vertexType = 'User', "
"typeIdProperty = 'id', "
"typeIdType = 'Long', "
"typeIdUnique = true"
),
)
db.command(
"sql",
(
"IMPORT DATABASE WITH "
"edges = 'file:///data/follows.csv', "
"edgeType = 'Follows', "
"typeIdProperty = 'id', "
"typeIdType = 'Long', "
"edgeFromField = 'from', "
"edgeToField = 'to'"
),
)
See Also¶
- Import Workflow Reference - Supported SQL import surface
- Import Examples - Practical examples
- Database API - Database operations
- Transactions - Transaction management