un-fsync'd writes
Hi,
SpacetimeDB is not fsync'ing all of its writes.
Using:
it's possible to run tests that lose un-fsync'd writes which:
- cause STDB to become unresponsive/hang
- prevent STDB from being able to be restarted due to file corruption
unsynced-data-report
Let's start by just periodically asking LazyFS to report on any un-fsync'd data.
Here's a sample entry showing an un-fsync'd file:
[2026-04-02 03:04:29.797] [global] [info] [lazyfs.fifo]: running LazyFS...
[2026-04-02 03:04:29.873] [global] [info] [lazyfs.faults.worker]: waiting for fault commands...
[2026-04-02 03:06:51.899] [global] [info] [lazyfs.faults.worker]: received 'lazyfs::unsynced-data-report'
[2026-04-02 03:06:51.899] [global] [warning] [lazyfs.cmds]: report request submitted...
[2026-04-02 03:06:51.899] [global] [warning] [lazyfs.cmds]: report generated.
[2026-04-02 03:06:51.899] [global] [info] [lazyfs.cmds]: report: [inode 8660154] is not fully synced, info:
[2026-04-02 03:06:51.899] [global] [info] [lazyfs.cmds]: report: [inode 8660154] info: (block 0) to (block 0) [byte index 0 to index 620]
[2026-04-02 03:06:51.899] [global] [info] [lazyfs.cmds]: report: [inode 8660154] files mapped to this inode:
[2026-04-02 03:06:51.899] [global] [info] [lazyfs.cmds]: report: [inode 8660154] => file: '/root/.local/share/spacetime/data.lazyfs/data/replicas/1/snapshots/00000000000000000000.snapshot_dir/00000000000000000000.snapshot_bsatn'
[2026-04-02 03:06:51.899] [global] [info] [lazyfs.cmds]: report: total number of bytes un-fsynced: 621 bytes.
...
Now let's get a list of files from a test run that were not fsync'd:
cat unsynced-data-report.log | grep "file\:" | awk '{split($0,a,"file\:"); print a[2]}' | sort | uniq
'/root/.local/share/spacetime/data.lazyfs/data/config.toml'
'/root/.local/share/spacetime/data.lazyfs/data/control-db/conf'
'/root/.local/share/spacetime/data.lazyfs/data/logs/spacetime-standalone.log'
'/root/.local/share/spacetime/data.lazyfs/data/metadata.toml'
'/root/.local/share/spacetime/data.lazyfs/data/replicas/1/clog/00000000000000000000.stdb.log'
'/root/.local/share/spacetime/data.lazyfs/data/replicas/1/clog/00000000000000000000.stdb.ofs'
'/root/.local/share/spacetime/data.lazyfs/data/replicas/1/module_logs/2026-04-19.log'
'/root/.local/share/spacetime/data.lazyfs/data/replicas/1/snapshots/00000000000000000000.snapshot_dir/00000000000000000000.snapshot_bsatn'
'/root/.local/share/spacetime/data.lazyfs/data/replicas/1/snapshots/00000000000000000000.snapshot_dir/objects/16/8ac0675ea0741a07c3b2d6bde92f1464f8017929896035a62be0fab889fa13'
'/root/.local/share/spacetime/data.lazyfs/data/replicas/1/snapshots/00000000000000000000.snapshot_dir/objects/35/363d07adb65d25ca55e3f951a5b70b75daa588520a9df38b5b11dfc40d70f3'
'/root/.local/share/spacetime/data.lazyfs/data/replicas/1/snapshots/00000000000000000000.snapshot_dir/objects/3d/88776149b8100f9e751d9b361982173799b47a95eae398d5db47f189492905'
'/root/.local/share/spacetime/data.lazyfs/data/replicas/1/snapshots/00000000000000000000.snapshot_dir/objects/47/94421cb4cd34574596f00507771ddb2075febc66a5283ba16e89875a4c3414'
'/root/.local/share/spacetime/data.lazyfs/data/replicas/1/snapshots/00000000000000000000.snapshot_dir/objects/78/f2e8140cf5d713c997f3fffbf62932be883760b4182871165b98a974e3b977'
'/root/.local/share/spacetime/data.lazyfs/data/replicas/1/snapshots/00000000000000000000.snapshot_dir/objects/7f/daf20309f427660a96330b203af22cf04cf5cd7c89977828441cfa04d92300'
'/root/.local/share/spacetime/data.lazyfs/data/replicas/1/snapshots/00000000000000000000.snapshot_dir/objects/aa/88b3f62336775a03d48aff5085cd0e127d6ccf860fa93b50e21f4a0e7afc10'
'/root/.local/share/spacetime/data.lazyfs/data/replicas/1/snapshots/00000000000000000000.snapshot_dir/objects/ac/b9d59241f0d3602da7427b2461b812b1e3e0d79c50fe6fbd1c17471597c21d'
'/root/.local/share/spacetime/data.lazyfs/data/replicas/1/snapshots/00000000000000000000.snapshot_dir/objects/bc/342a9d872ed09549a5610d55cef4cea9dd3232b2453d7318a96115f5bac511'
'/root/.local/share/spacetime/data.lazyfs/data/replicas/1/snapshots/00000000000000000000.snapshot_dir/objects/ef/4fd96b1b9dfed141369b7cf0336a68f14416bbc104cf26290777dc3a3207ae'
'/root/.local/share/spacetime/data.lazyfs/data/spacetime.pid'
Sample unsynced-data-report.log.
lose-unfsynced-writes
Now let's run a test that periodically loses un-fsync'd writes.
When losing un-fsync'd writes:
[2026-03-30 01:35:43.463] [global] [info] [lazyfs.fifo]: running LazyFS...
[2026-03-30 01:35:43.512] [global] [info] [lazyfs.faults.worker]: waiting for fault commands...
[2026-03-30 01:39:30.958] [global] [info] [lazyfs.faults.worker]: received 'lazyfs::clear-cache'
[2026-03-30 01:39:30.958] [global] [warning] [lazyfs.cmds]: clear cache request submitted...
[2026-03-30 01:39:30.963] [global] [warning] [lazyfs.cmds]: cache is cleared.
the server ultimately fails to respond to a watchdog liveness check:
spacetime sql --confirmed true --anonymous --server local --yes spacetimedb-db-name select * from table
so an attempt is made to restart the server:
spacetime start --pg-port pg-port --non-interactive
and attempts to restart the server fail due to lost writes to metadata.toml:
spacetimedb-standalone version: 2.1.0
spacetimedb-standalone path: /root/.local/share/spacetime/bin/2.1.0/spacetimedb-standalone
database running in data directory /root/.local/share/spacetime/data
...
Error: failed reading metadata.toml
Caused by:
TOML parse error at line 1, column 1
|
1 |
| ^
missing field `version`
Note that there are multiple files with lost data but STDB startup always fails on the initial read/parse of metadata.toml.
A test run where STDB becomes unresponsive/hung, due to losing un-fsync'd writes, and is not restartable:
Simulated Power Glitch
To show that other essential files besides metadata.toml can lose data due to un-fsync'd writes, let's do a test that simulates a power glitch after we manually checkpoint lazyfs to persist the writes to metadata.toml:
- let db do some work, i.e. writes
checkpoint the lazyfs filesystem
- maybe triggered by application, os, etc.
- write un-fsync'd writes to the underlying filesystem
- let db do some work, i.e. writes
- 🌩️ power glitch
- kill the db's process
lose-unfsynced-writes
- ☀️ power normal
- attempt to restart the db
An example of when the server is not able to be restarted due to data corruption:
spacetimedb-standalone version: 2.1.0
spacetimedb-standalone path: /root/.local/share/spacetime/bin/2.1.0/spacetimedb-standalone
database running in data directory /root/.local/share/spacetime/data
...
�[2m2026-04-19T01:53:44.937989Z�[0m �[31mERROR�[0m �[2m/home/runner/work/SpacetimeDB/SpacetimeDB/crates/client-api/src/lib.rs�[0m�[2m:�[0m�[2m623�[0m�[2m:�[0m internal error: error starting database: failed to init replica 1 for c2007b1126eb275777e8d4bf0c9bf379d291e1d8a658bc0554faf997ad2c0ee2: /root/.local/share/spacetime/data/replicas/1/clog/00000000000000000000.stdb.log [extracting segment metadata]: segment header does not start with magic: expected [28, 64, 73, 29, 5e, 32], got [01, 00, 00, 00, 00, 00]
...
... repeats until end of test
...
A test run simulating a power glitch where STDB is not restartable due to data corruption:
SIGTERM vs SIGKILL
Using SIGTERM vs SIGKILL to request a graceful shutdown of the SpacetimeDB process does not change the test results, i.e. some writes are not fsync'd, and if lost, the server cannot be restarted.
As is apparent, I've been investing the space/time, 😸, to see what a Jepsen test for SpacetimeDB would look like.
It's a lot of work to do correctly and comprehensively, but a Jepsen test is always so valuable.
Thanks for SpacetimeDB!
un-fsync'd writes
Hi,
SpacetimeDB is not fsync'ing all of its writes.
Using:
jepsen.lazyfsit's possible to run tests that lose un-fsync'd writes which:
unsynced-data-reportLet's start by just periodically asking
LazyFSto report on any un-fsync'd data.Here's a sample entry showing an un-fsync'd file:
Now let's get a list of files from a test run that were not fsync'd:
Sample unsynced-data-report.log.
lose-unfsynced-writesNow let's run a test that periodically loses un-fsync'd writes.
When losing un-fsync'd writes:
the server ultimately fails to respond to a watchdog liveness check:
so an attempt is made to restart the server:
and attempts to restart the server fail due to lost writes to
metadata.toml:Note that there are multiple files with lost data but STDB startup always fails on the initial read/parse of
metadata.toml.A test run where STDB becomes unresponsive/hung, due to losing un-fsync'd writes, and is not restartable:
Simulated Power Glitch
To show that other essential files besides
metadata.tomlcan lose data due to un-fsync'd writes, let's do a test that simulates a power glitch after we manuallycheckpointlazyfsto persist the writes tometadata.toml:checkpointthelazyfsfilesystemlose-unfsynced-writesAn example of when the server is not able to be restarted due to data corruption:
A test run simulating a power glitch where STDB is not restartable due to data corruption:
SIGTERMvsSIGKILLUsing
SIGTERMvsSIGKILLto request a graceful shutdown of the SpacetimeDB process does not change the test results, i.e. some writes are not fsync'd, and if lost, the server cannot be restarted.As is apparent, I've been investing the space/time, 😸, to see what a Jepsen test for SpacetimeDB would look like.
It's a lot of work to do correctly and comprehensively, but a Jepsen test is always so valuable.
Thanks for SpacetimeDB!