Example 15: Import Database vs Transactional Table Ingest¶
This example compares four table-ingest strategies against the same generated dataset shape.
Overview¶
Example 15 is the table-ingest comparison harness for embedded Python.
- Generates synthetic multi-table CSV data
- Runs four ingest modes:
- transactional SQL inserts in batches
- async SQL inserts via the async executor
- SQL
IMPORT DATABASE - Python
db.import_documents(...)
- Checks final row-count parity before treating the timing result as valid
Current Repository Guidance¶
- This example exists because ingest winners are workload-dependent
- Both
IMPORT DATABASEanddb.import_documents(...)are possible ingestion paths - Both importer-based paths have shown practical issues on larger real workloads, including memory pressure and possible OoM failure modes
- Async SQL insert is the recommended default for Python-managed bulk table/document ingest in this repository
- Keep async SQL on a single worker for this path; multi-threaded async insert has not been safe or reliable in the current Python examples
- The broader example set therefore standardizes on async SQL insert for document preloading
Recent Benchmark Snapshot¶
For this shape:
tables=10rows-per-table=1,000,000columns=20plusidstring-size=128batch-size=10,000heap-size=8g
Measured times:
Transactional INSERT:189.921sAsync SQL INSERT:146.670sIMPORT DATABASEwith--parallel 1:58.281sIMPORT DATABASEwith--parallel 4:59.868s
For this synthetic workload, IMPORT DATABASE remained the fastest option, but
increasing import parallelism from 1 to 4 did not provide a meaningful speedup.
That benchmark result should not be treated as the repository-wide recommendation. For the real document-preload examples, single-worker async SQL insert is the preferred path because it has been more reliable in practice than the importer-based paths.
Run¶
From bindings/python/examples:
python 15_import_database_vs_transactional_table_ingest.py \
--tables 4 \
--rows-per-table 100000 \
--columns 20 \
--string-size 64 \
--batch-size 10000 \
--async-parallel 1 \
--parallel 8 \
--heap-size 4g
Key Options¶
--tables: number of generated tables--rows-per-table: rows per table--columns: extra columns per table in addition toid--string-size: generated string payload size--batch-size: ingest batch size--async-parallel: async SQL worker count; keep this at1for the recommended path--parallel: SQL import worker count--import-chunk-rows: chunk size for theimport_documentsbenchmark path--heap-size: JVM heap size
Parity Semantics¶
Timing comparisons only matter if all four modes load the expected final row counts across the generated tables.
The example also performs a lightweight post-run parity check over schema, aggregates, and sampled rows so that major output differences are reported alongside the timing results.