Example 07: Stack Overflow Tables (OLTP)¶

View source code

This example loads the Stack Overflow XML tables into a relational-style layout and runs a point-oriented CRUD workload.

Overview¶

Example 07 is the table OLTP benchmark for the Stack Overflow dataset.

It loads eight XML-derived tables.
It uses a fixed mixed workload: 60% read, 20% update, 10% insert, 10% delete.
Every operation picks one table and one target row at random.
It measures preload time, index time, OLTP latency, throughput, disk usage, and RSS.

Source Tables¶

The example defines these tables directly in TABLE_DEFS.

Table	Source file	Columns
`User`	`Users.xml`	`Id`, `Reputation`, `CreationDate`, `DisplayName`, `LastAccessDate`, `WebsiteUrl`, `Location`, `AboutMe`, `Views`, `UpVotes`, `DownVotes`, `AccountId`
`Post`	`Posts.xml`	`Id`, `PostTypeId`, `ParentId`, `AcceptedAnswerId`, `CreationDate`, `Score`, `ViewCount`, `Body`, `OwnerUserId`, `LastEditorUserId`, `LastEditorDisplayName`, `LastEditDate`, `LastActivityDate`, `Title`, `Tags`, `AnswerCount`, `CommentCount`, `FavoriteCount`, `ClosedDate`, `CommunityOwnedDate`
`Comment`	`Comments.xml`	`Id`, `PostId`, `Score`, `Text`, `CreationDate`, `UserId`
`Badge`	`Badges.xml`	`Id`, `UserId`, `Name`, `Date`, `Class`, `TagBased`
`Vote`	`Votes.xml`	`Id`, `PostId`, `VoteTypeId`, `CreationDate`, `UserId`, `BountyAmount`
`PostLink`	`PostLinks.xml`	`Id`, `CreationDate`, `PostId`, `RelatedPostId`, `LinkTypeId`
`Tag`	`Tags.xml`	`Id`, `TagName`, `Count`, `ExcerptPostId`, `WikiPostId`
`PostHistory`	`PostHistory.xml`	`Id`, `PostHistoryTypeId`, `PostId`, `RevisionGUID`, `CreationDate`, `UserId`, `UserDisplayName`, `Comment`, `Text`, `CloseReasonId`

Each backend also creates a unique Id index for every table.

Supported Backends¶

arcadedb_sql
sqlite
duckdb
postgresql

Run¶

From bindings/python/examples:

python 07_stackoverflow_tables_oltp.py \
  --dataset stackoverflow-tiny \
  --db arcadedb_sql \
  --threads 1 \
  --transactions 10000 \
  --batch-size 10000 \
  --mem-limit 4g \
  --verify-single-thread-series

Workload Model¶

The operation planner is fixed in source as:

DEFAULT_OLTP_MIX = {"read": 0.60, "update": 0.20, "insert": 0.10, "delete": 0.10}

The selected table for each operation is random across all eight tables.

Read Projections¶

Reads always filter on Id, but the projected columns vary by table. The source code uses get_read_projection(), which chooses one of two projections per table.

Table	Projection A	Projection B
`User`	`Id, Reputation, CreationDate`	`Id, Reputation, DisplayName`
`Post`	`Id, PostTypeId, ParentId`	`Id, PostTypeId, AcceptedAnswerId`
`Comment`	`Id, PostId, Score`	`Id, PostId, Text`
`Badge`	`Id, UserId, Name`	`Id, UserId, Date`
`Vote`	`Id, PostId, VoteTypeId`	`Id, PostId, CreationDate`
`PostLink`	`Id, CreationDate, PostId`	`Id, CreationDate, RelatedPostId`
`Tag`	`Id, TagName, Count`	`Id, TagName, ExcerptPostId`
`PostHistory`	`Id, PostHistoryTypeId, PostId`	`Id, PostHistoryTypeId, RevisionGUID`

Update Targets¶

The update column is the first non-Id field whose declared type is INTEGER or BOOLEAN.

Table	Updated column
`User`	`Reputation`
`Post`	`PostTypeId`
`Comment`	`PostId`
`Badge`	`UserId`
`Vote`	`PostId`
`PostLink`	`PostId`
`Tag`	`Count`
`PostHistory`	`PostHistoryTypeId`

Insert Payloads¶

Insert rows are synthetic. The source uses build_synthetic_row():

Id is the next generated integer for that table.
INTEGER fields get a random value in [1, 1000].
BOOLEAN fields get a random boolean.
DATETIME fields get the current UTC timestamp.
string fields get synthetic_<Table>_<Field>_<Id>.

Query Suite¶

ArcadeDB SQL¶

The ArcadeDB path issues SQL directly.

ArcadeDB Read¶

SELECT {projection} FROM {table_name} WHERE Id = {target_id}

ArcadeDB Update¶

UPDATE {table_name} SET {update_col} = coalesce({update_col}, 0) + 1 WHERE Id = {target_id}

ArcadeDB Insert¶

INSERT INTO {table_name} SET {col1} = ?, {col2} = ?, ...

The same statement shape is used both for preload batches and OLTP inserts.

ArcadeDB Delete¶

DELETE FROM {table_name} WHERE Id = {target_id}

SQLite¶

The SQLite path uses parameterized SQL with quoted identifiers.

SQLite Read¶

SELECT {projection} FROM "{table_name}" WHERE "Id" = ?

SQLite Update¶

UPDATE "{table_name}" SET "{update_col}" = coalesce("{update_col}", 0) + 1 WHERE "Id" = ?

SQLite Insert¶

INSERT INTO "{table_name}" ("col1", "col2", ...) VALUES (?, ?, ...)

SQLite Delete¶

DELETE FROM "{table_name}" WHERE "Id" = ?

DuckDB¶

The DuckDB OLTP path uses the same SQL shapes as SQLite, with explicit transactions.

DuckDB Read¶

SELECT {projection} FROM "{table_name}" WHERE "Id" = ?

DuckDB Update¶

UPDATE "{table_name}" SET "{update_col}" = coalesce("{update_col}", 0) + 1 WHERE "Id" = ?

DuckDB Insert¶

INSERT INTO "{table_name}" ("col1", "col2", ...) VALUES (?, ?, ...)

DuckDB Delete¶

DELETE FROM "{table_name}" WHERE "Id" = ?

PostgreSQL¶

The PostgreSQL path uses quoted identifiers and %s parameters.

Read¶

SELECT {projection_sql} FROM "{table_name}" WHERE "Id" = %s

The projection_sql string is built by quoting each projected column returned by get_read_projection().

Update¶

UPDATE "{table_name}" SET "{update_col}" = coalesce("{update_col}", 0) + 1 WHERE "Id" = %s

Insert¶

INSERT INTO "{table_name}" ("col1", "col2", ...) VALUES (%s, %s, ...)

Delete¶

DELETE FROM "{table_name}" WHERE "Id" = %s

Preload Paths¶

The benchmark does not use the same preload mechanism for every backend.

ArcadeDB preload uses async INSERT INTO ... SET ... SQL.
SQLite preload uses batched INSERT INTO ... VALUES ... statements.
DuckDB preload uses per-table CSV materialization followed by:

COPY "{table_name}" FROM '{csv_path}' (AUTO_DETECT TRUE, HEADER TRUE)

PostgreSQL preload writes CSV and then streams it through:

COPY "{table_name}" ("col1", "col2", ...) FROM STDIN WITH (FORMAT CSV, HEADER TRUE)

Those load-path differences matter for ingest timing, but they do not change the OLTP CRUD statements listed above.

Result Notes¶

du_mib is real post-run filesystem usage.
disk_after_* fields are benchmark-reported logical size counters.
Per-operation latency is summarized from the recorded read, update, insert, and delete latency buckets.
--verify-single-thread-series checks deterministic repeatability for one backend configuration, not strict equality across different engines.