Architecture¶
Technical documentation for the ArcadeDB Python bindings architecture, JPype integration, and implementation details.
Overview¶
The ArcadeDB Python bindings are a thin wrapper around the ArcadeDB Java library using JPype for JVM integration. This design provides:
- Full API Coverage: Access to all ArcadeDB features
- Performance: Minimal Python overhead
- Maintenance: Automatic feature parity with Java releases
- Type Safety: Python type hints with Java type conversion
Module Structure¶
arcadedb_embedded/
├── __init__.py # Package exports and version
├── jvm.py # JVM startup (bundled JRE, JAR discovery)
├── core.py # Database, DatabaseFactory, convenience helpers
├── graph.py # Document, Vertex, Edge wrappers
├── schema.py # Schema/Index/Property helpers
├── type_conversion.py # Java ↔ Python value conversion
├── async_executor.py # Async bulk executor wrapper
├── batch.py # BatchContext (async-powered bulk helper)
├── importer.py # Importer (CSV/XML)
├── exporter.py # Export (JSONL/GraphML/GraphSON/CSV)
├── vector.py # VectorIndex + array helpers
├── results.py # ResultSet, Result (query results)
├── transactions.py # TransactionContext (ACID guard)
├── server.py # ArcadeDBServer (HTTP/Studio)
└── exceptions.py # ArcadeDBError (unified exceptions)
Module Responsibilities¶
__init__.py
- Central export surface (Database, AsyncExecutor, BatchContext, Schema, Importer, Exporter, VectorIndex, converters)
- Version metadata
jvm.py
- Starts JVM using bundled JRE and packaged JARs
- Honors
ARCADEDB_JVM_ARGS/ARCADEDB_JVM_ERROR_FILE - Refuses to start twice in a process
core.py
DatabaseFactory: create/open databasesDatabase: queries/commands, transactions, lookups, vector index builder- Convenience:
async_executor(),batch_context(),schema, export helpers
graph.py
- Record wrappers:
Document,Vertex,Edge - Property helpers,
new_edge(), type-aware wrapping from Java records
schema.py
Schema: type/property/index managementIndexType,PropertyTypeenums
type_conversion.py
convert_java_to_python/convert_python_to_java- Datetime/Decimal/collection handling
async_executor.py
AsyncExecutor: parallel record creation, commitEvery, WAL tuning
batch.py
BatchContext: high-level bulk helper on top ofAsyncExecutor
importer.py
Importer: CSV/XML imports (documents/vertices/edges), FK resolution, commitEveryimport_csv/import_xmlconvenience wrappers
exporter.py
export_database: JSONL/GraphML/GraphSONexport_to_csv: serialize ResultSet/list to CSV
vector.py
VectorIndex: JVector-based ANN searchto_java_float_array/to_python_array
results.py
ResultSet: iterator, chunking, DataFrame exportResult: property access with conversion
transactions.py
TransactionContext: context-managed begin/commit/rollback
server.py
ArcadeDBServer: HTTP/Studio server lifecycle, db management
exceptions.py
ArcadeDBError: unified exception wrapper
JPype Integration¶
JVM Lifecycle¶
def start_jvm():
if jpype.isJVMStarted():
return
# Locate bundled JRE + packaged JARs
jvm_path = get_bundled_jre_lib_path()
jar_files = glob.glob(os.path.join(get_jar_path(), "*.jar"))
# Memory/flags can be overridden via ARCADEDB_JVM_ARGS
jvm_args = os.environ.get("ARCADEDB_JVM_ARGS")
if jvm_args:
args = jvm_args.split()
else:
args = ["-Xmx4g", "-Djava.awt.headless=true"]
# Crash log location (default ./log/hs_err_pid%p.log)
error_file = os.environ.get("ARCADEDB_JVM_ERROR_FILE", "./log/hs_err_pid%p.log")
args.append(f"-XX:ErrorFile={error_file}")
# Single-shot startup per process
jpype.startJVM(jvm_path, *args, classpath=os.pathsep.join(jar_files))
JVM Startup:
- Uses the bundled JRE inside the wheel (no system JVM required)
- Loads packaged ArcadeDB JARs from
arcadedb_embedded/jars - Configurable via env vars before first import (
ARCADEDB_JVM_ARGS,ARCADEDB_JVM_ERROR_FILE) - JVM stays live for the process lifetime and cannot be restarted
Implications:
- Set memory/GC flags before importing
arcadedb_embedded - Tests that need different JVM args must run in separate processes
- Server and embedded modes share the same in-process JVM
Type Conversion¶
Python → Java:
# String
python_str = "hello"
java_str = jpype.JString(python_str)
# Array
python_array = [1.0, 2.0, 3.0]
java_array = jpype.JArray(jpype.JFloat)(python_array)
# NumPy → Java (vectors)
import numpy as np
from arcadedb_embedded import to_java_float_array
numpy_array = np.array([1.0, 2.0, 3.0], dtype=np.float32)
java_array = to_java_float_array(numpy_array)
Java → Python:
# Automatic for primitives
java_int = some_java_method() # Returns Java int
python_int = int(java_int) # Automatic conversion
# Manual for complex types
java_list = some_java_method()
python_list = [item for item in java_list]
# Java array → NumPy
from arcadedb_embedded import to_python_array
java_array = vertex.get("embedding")
numpy_array = to_python_array(java_array)
Type Mapping:
| Python Type | Java Type | Notes |
|---|---|---|
str |
String |
Manual with JString() |
int |
Integer/Long |
Automatic |
float |
Float/Double |
Automatic |
bool |
Boolean |
Automatic |
None |
null |
Automatic |
list |
ArrayList |
Manual conversion |
dict |
HashMap |
Manual conversion |
np.ndarray |
float[] |
via to_java_float_array() |
Memory Management¶
Garbage Collection:
- Python GC: Manages Python objects
- Java GC: Manages Java objects
- JPype: Bridges both, uses Java GC for wrapped objects
Best Practices:
# Good: Explicit cleanup
db = arcadedb.open_database("./mydb")
try:
# Use database
pass
finally:
db.close()
# Better: Context manager
db = arcadedb.open_database("./mydb")
with db.transaction():
# Work with database
db.close()
# Long-running processes: Periodic GC
import gc
for batch in large_dataset:
process_batch(batch)
gc.collect() # Trigger Python GC
Memory Leaks:
- Holding references to Java objects prevents GC
- Large ResultSets should be consumed and released
- Server mode: Monitor JVM heap usage
Class Hierarchy¶
DatabaseFactory (core.py)
├─ create_database() / open_database() / database_exists()
└─ returns Database
Database (core.py)
├─ query()/command() → ResultSet | None
├─ begin()/commit()/rollback()/transaction() → TransactionContext
├─ new_vertex()/new_document() → Vertex | Document
├─ lookup_by_key()/lookup_by_rid()
├─ create_vector_index() → VectorIndex
├─ async_executor() → AsyncExecutor (async_executor.py)
├─ batch_context() → BatchContext (batch.py)
├─ schema → Schema (schema.py)
├─ export_database()/export_to_csv()
└─ close()/is_open()
Schema (schema.py)
├─ create_document_type()/create_vertex_type()/create_edge_type()
├─ get_or_create_* helpers
└─ create_property()/create_index()
AsyncExecutor (async_executor.py)
├─ set_parallel_level()/set_commit_every()/set_back_pressure()
├─ create_record()/update_record()/delete_record()/create_edge_between()
└─ wait_completion()/close()
BatchContext (batch.py)
├─ create_vertex()/create_document()/create_edge()
├─ set_total()/get_errors()
└─ context-managed wait/cleanup
Record wrappers (graph.py)
├─ Vertex → new_edge(), modify(), get_out_edges() (todo), property helpers
├─ Edge → get_out(), get_in(), modify()
└─ Document → get()/set()/save()/delete()/modify(), to_dict(), get_rid()
ResultSet (results.py)
├─ iterator protocol
├─ to_list()/to_dataframe()/iter_chunks()/count()/first()/one()
└─ wraps Result objects
Result (results.py)
├─ has_property()/get()
└─ to_dict()/to_json()
Threading Model¶
Thread Safety¶
Database:
Databaseinstances are NOT thread-safe- Each thread needs its own
Databaseinstance - Transactions are thread-local
Async executor & batch:
AsyncExecutoris thread-safe; it runs callbacks on Java worker threadsBatchContextwraps a singleDatabase+AsyncExecutor, so do not share it across threads
Example:
import threading
import arcadedb_embedded as arcadedb
def worker(db_path, worker_id):
"""Worker thread with own database instance."""
db = arcadedb.open_database(db_path)
try:
with db.transaction():
vertex = db.new_vertex("Worker")
vertex.set("id", worker_id)
vertex.save()
finally:
db.close()
# Spawn workers
threads = []
for i in range(5):
t = threading.Thread(target=worker, args=("./mydb", i))
t.start()
threads.append(t)
for t in threads:
t.join()
Server Mode:
ArcadeDBServeris thread-safe- HTTP requests handled by internal thread pool
- Each request gets isolated transaction
Multiprocessing¶
Safe:
- Separate processes with separate JVMs
- No shared state
- Ideal for parallel imports
Example:
import multiprocessing as mp
import arcadedb_embedded as arcadedb
def process_chunk(db_path, chunk):
"""Process chunk in separate process."""
# Each process has own JVM
db = arcadedb.open_database(db_path)
with db.transaction():
for record in chunk:
vertex = db.new_vertex("Data")
vertex.set("data", record)
vertex.save()
db.close()
# Split work across processes
if __name__ == "__main__":
chunks = split_data_into_chunks()
with mp.Pool(processes=4) as pool:
pool.starmap(process_chunk,
[("./mydb", chunk) for chunk in chunks])
Performance Considerations¶
Bottlenecks¶
- JVM Boundary Crossing
- Cost: ~1-10 μs per Java method call
- Impact: High-frequency calls (loops)
- Solution: Batch operations, use Java bulk APIs
- Type Conversion
- Cost: Varies by type (arrays expensive)
- Impact: Large data transfers
- Solution: Minimize conversions, use efficient formats
- Transaction Overhead
- Cost: ~100 μs per transaction
- Impact: Many small transactions
- Solution: Batch into larger transactions
Optimization Strategies¶
Batch Operations:
# Bad: Many small transactions
for record in records:
with db.transaction():
vertex = db.new_vertex("Data")
vertex.set("data", record)
vertex.save()
# 1000 records = 1000 transactions
# Good: One large transaction
with db.transaction():
for record in records:
vertex = db.new_vertex("Data")
vertex.set("data", record)
vertex.save()
# 1000 records = 1 transaction
Query Optimization:
# Bad: N+1 queries
users = db.query("sql", "SELECT FROM User")
for user in users:
# Separate query per user!
orders = db.query("sql", f"SELECT FROM Order WHERE user_id = '{user.get('id')}'")
# Good: Single query with traversal
result = db.query("sql", """
SELECT
name,
out('Placed').name as orders
FROM User
""")
ResultSet Streaming:
# Bad: Load all results
result = db.query("sql", "SELECT FROM LargeTable")
all_results = list(result) # Loads everything into memory
# Good: Stream results
result = db.query("sql", "SELECT FROM LargeTable")
for row in result:
process(row)
# Only one row in memory at a time
Profiling¶
Python Side:
import cProfile
import pstats
def benchmark():
db = arcadedb.create_database("./bench")
with db.transaction():
for i in range(10000):
vertex = db.new_vertex("Data")
vertex.set("id", i)
vertex.save()
db.close()
# Profile
cProfile.run('benchmark()', 'stats.prof')
stats = pstats.Stats('stats.prof')
stats.sort_stats('cumulative')
stats.print_stats(20)
Java Side:
# Enable JVM profiling
import jpype
# Before starting JVM (in package __init__.py):
jpype.startJVM(
classpath=[jar_path],
convertStrings=False,
# JVM profiling options:
"-XX:+PrintGCDetails",
"-Xloggc:gc.log"
)
Single Package Distribution¶
The Python binding is distributed as a single, self-contained package (arcadedb-embedded).
Features:
- Bundled JRE: Includes a minimal Java 25 Runtime Environment (JRE) via the
arcadedb-embedded-jrepackage. - Full Feature Set: Includes all ArcadeDB engines (SQL, OpenCypher, GraphQL, MongoDB).
- Zero Configuration: No external Java installation required.
# pip install arcadedb-embedded
import arcadedb_embedded as arcadedb
db = arcadedb.create_database("./mydb")
db.query("sql", "SELECT FROM User") # ✓
db.query("opencypher", "MATCH (n) RETURN n") # ✓
db.query("graphql", "{users{name}}") # ✓
db.query("mongodb", "db.User.find()") # ✓
Extension Points¶
Custom Vertex/Edge Classes¶
from arcadedb_embedded import Database
class CustomVertex:
"""Custom vertex wrapper with helper methods."""
def __init__(self, java_vertex):
self._java_vertex = java_vertex
def get_friends(self):
"""Get all friends (out edges of type 'Knows')."""
edges = self._java_vertex.getEdges(
jpype.JClass('com.arcadedata.engine.api.graph.Vertex$DIRECTION').OUT,
"Knows"
)
return [edge.getVertex() for edge in edges]
# Usage with wrapped database
Custom Importers¶
class CustomImporter:
"""Custom import format handler."""
def __init__(self, db):
self.db = db
def import_xml(self, file_path, vertex_type):
"""Import XML format."""
import xml.etree.ElementTree as ET
tree = ET.parse(file_path)
root = tree.getroot()
with self.db.transaction():
for elem in root.findall('.//record'):
vertex = self.db.new_vertex(vertex_type)
for child in elem:
vertex.set(child.tag, child.text)
vertex.save()
# Usage
importer = CustomImporter(db)
importer.import_xml("data.xml", "Data")
Testing¶
Unit Tests¶
import unittest
import arcadedb_embedded as arcadedb
import tempfile
import shutil
class TestDatabase(unittest.TestCase):
def setUp(self):
"""Create temp database for each test."""
self.db_path = tempfile.mkdtemp()
self.db = arcadedb.create_database(self.db_path)
def tearDown(self):
"""Clean up."""
self.db.close()
shutil.rmtree(self.db_path)
def test_create_vertex(self):
"""Test vertex creation."""
with self.db.transaction():
vertex = self.db.new_vertex("User")
vertex.set("name", "Alice")
vertex.save()
result = self.db.query("sql", "SELECT count(*) as count FROM User")
count = result.next().get("count")
self.assertEqual(count, 1)
Integration Tests¶
import pytest
import arcadedb_embedded as arcadedb
@pytest.fixture(scope="module")
def database():
"""Shared database for integration tests."""
db = arcadedb.create_database("./test_db")
yield db
db.close()
def test_graph_traversal(database):
"""Test complex graph operations."""
# Setup
with database.transaction():
alice = database.new_vertex("User")
alice.set("name", "Alice")
alice.save()
bob = database.new_vertex("User")
bob.set("name", "Bob")
bob.save()
edge = alice.new_edge("Knows", bob)
edge.save()
# Test
result = database.query("sql", """
SELECT expand(out('Knows'))
FROM User
WHERE name = 'Alice'
""")
friends = list(result)
assert len(friends) == 1
assert friends[0].get("name") == "Bob"
Build System¶
Package Build¶
# pyproject.toml configuration
[build-system]
requires = ["setuptools>=45", "wheel"]
build-backend = "setuptools.build_meta"
[project]
name = "arcadedb-embedded"
version = "0.1.0"
dependencies = ["JPype1>=1.4.0", "numpy>=1.20.0"]
JAR Management¶
# setup_jars.py - Download and package JARs
import requests
import os
def download_arcadedb_jars(version):
"""Download ArcadeDB distribution JARs."""
base_url = f"https://repo1.maven.org/maven2/com/arcadedata/"
# Core JAR
jar_url = f"{base_url}arcadedb/{version}/arcadedb-{version}.jar"
download_jar(jar_url, f"arcadedb-{version}.jar")
# Download all dependencies (GraphQL, etc.)
pass
# Called during package build
See Also¶
- Database API Reference - Core database operations
- Troubleshooting - Common issues and solutions
- JPype Documentation - JPype library docs
- ArcadeDB Java API - Underlying Java API