un-fsync'd writes

# un-fsync'd writes

Hi,

SpacetimeDB is not fsync'ing all of its writes.

Using:

- [LazyFS](https://github.com/dsrhaslab/lazyfs)
- [When Amnesia Strikes: Understanding and Reproducing Data Loss Bugs with Fault Injection](https://dsr-haslab.github.io/assets/files/2024/lazyfs-vldb24-mariaramos.pdf)
- [`jepsen.lazyfs`](https://jepsen-io.github.io/jepsen/jepsen.lazyfs.html)

it's possible to run tests that lose un-fsync'd writes which:

- cause STDB to become unresponsive/hang
- prevent STDB from being able to be restarted due to file corruption

----

## `unsynced-data-report`

Let's start by just periodically asking `LazyFS` to report on any un-fsync'd data.

Here's a sample entry showing an un-fsync'd file:

```log
[2026-04-02 03:04:29.797] [global] [info] [lazyfs.fifo]: running LazyFS...
[2026-04-02 03:04:29.873] [global] [info] [lazyfs.faults.worker]: waiting for fault commands...
[2026-04-02 03:06:51.899] [global] [info] [lazyfs.faults.worker]: received 'lazyfs::unsynced-data-report'
[2026-04-02 03:06:51.899] [global] [warning] [lazyfs.cmds]: report request submitted...
[2026-04-02 03:06:51.899] [global] [warning] [lazyfs.cmds]: report generated.
[2026-04-02 03:06:51.899] [global] [info] [lazyfs.cmds]: report: [inode 8660154] is not fully synced, info:
[2026-04-02 03:06:51.899] [global] [info] [lazyfs.cmds]: report: [inode 8660154] info: (block 0) to (block 0) [byte index 0 to index 620]
[2026-04-02 03:06:51.899] [global] [info] [lazyfs.cmds]: report: [inode 8660154] files mapped to this inode:
[2026-04-02 03:06:51.899] [global] [info] [lazyfs.cmds]: report: [inode 8660154] => file: '/root/.local/share/spacetime/data.lazyfs/data/replicas/1/snapshots/00000000000000000000.snapshot_dir/00000000000000000000.snapshot_bsatn'
[2026-04-02 03:06:51.899] [global] [info] [lazyfs.cmds]: report: total number of bytes un-fsynced: 621 bytes.
...
```

Now let's get a list of files from a test run that were not fsync'd:

```bash
cat unsynced-data-report.log | grep "file\:" | awk '{split($0,a,"file\:"); print a[2]}' | sort | uniq

'/root/.local/share/spacetime/data.lazyfs/data/config.toml'
'/root/.local/share/spacetime/data.lazyfs/data/control-db/conf'
'/root/.local/share/spacetime/data.lazyfs/data/logs/spacetime-standalone.log'
'/root/.local/share/spacetime/data.lazyfs/data/metadata.toml'
'/root/.local/share/spacetime/data.lazyfs/data/replicas/1/clog/00000000000000000000.stdb.log'
'/root/.local/share/spacetime/data.lazyfs/data/replicas/1/clog/00000000000000000000.stdb.ofs'
'/root/.local/share/spacetime/data.lazyfs/data/replicas/1/module_logs/2026-04-19.log'
'/root/.local/share/spacetime/data.lazyfs/data/replicas/1/snapshots/00000000000000000000.snapshot_dir/00000000000000000000.snapshot_bsatn'
'/root/.local/share/spacetime/data.lazyfs/data/replicas/1/snapshots/00000000000000000000.snapshot_dir/objects/16/8ac0675ea0741a07c3b2d6bde92f1464f8017929896035a62be0fab889fa13'
'/root/.local/share/spacetime/data.lazyfs/data/replicas/1/snapshots/00000000000000000000.snapshot_dir/objects/35/363d07adb65d25ca55e3f951a5b70b75daa588520a9df38b5b11dfc40d70f3'
'/root/.local/share/spacetime/data.lazyfs/data/replicas/1/snapshots/00000000000000000000.snapshot_dir/objects/3d/88776149b8100f9e751d9b361982173799b47a95eae398d5db47f189492905'
'/root/.local/share/spacetime/data.lazyfs/data/replicas/1/snapshots/00000000000000000000.snapshot_dir/objects/47/94421cb4cd34574596f00507771ddb2075febc66a5283ba16e89875a4c3414'
'/root/.local/share/spacetime/data.lazyfs/data/replicas/1/snapshots/00000000000000000000.snapshot_dir/objects/78/f2e8140cf5d713c997f3fffbf62932be883760b4182871165b98a974e3b977'
'/root/.local/share/spacetime/data.lazyfs/data/replicas/1/snapshots/00000000000000000000.snapshot_dir/objects/7f/daf20309f427660a96330b203af22cf04cf5cd7c89977828441cfa04d92300'
'/root/.local/share/spacetime/data.lazyfs/data/replicas/1/snapshots/00000000000000000000.snapshot_dir/objects/aa/88b3f62336775a03d48aff5085cd0e127d6ccf860fa93b50e21f4a0e7afc10'
'/root/.local/share/spacetime/data.lazyfs/data/replicas/1/snapshots/00000000000000000000.snapshot_dir/objects/ac/b9d59241f0d3602da7427b2461b812b1e3e0d79c50fe6fbd1c17471597c21d'
'/root/.local/share/spacetime/data.lazyfs/data/replicas/1/snapshots/00000000000000000000.snapshot_dir/objects/bc/342a9d872ed09549a5610d55cef4cea9dd3232b2453d7318a96115f5bac511'
'/root/.local/share/spacetime/data.lazyfs/data/replicas/1/snapshots/00000000000000000000.snapshot_dir/objects/ef/4fd96b1b9dfed141369b7cf0336a68f14416bbc104cf26290777dc3a3207ae'
'/root/.local/share/spacetime/data.lazyfs/data/spacetime.pid'
 ```

Sample [unsynced-data-report.log](https://github.com/user-attachments/files/27026269/unsynced-data-report.log).

----

## `lose-unfsynced-writes`

Now let's run a test that periodically loses un-fsync'd writes.

When losing un-fsync'd writes:

```log
[2026-03-30 01:35:43.463] [global] [info] [lazyfs.fifo]: running LazyFS...
[2026-03-30 01:35:43.512] [global] [info] [lazyfs.faults.worker]: waiting for fault commands...
[2026-03-30 01:39:30.958] [global] [info] [lazyfs.faults.worker]: received 'lazyfs::clear-cache'
[2026-03-30 01:39:30.958] [global] [warning] [lazyfs.cmds]: clear cache request submitted...
[2026-03-30 01:39:30.963] [global] [warning] [lazyfs.cmds]: cache is cleared.
```

the server ultimately fails to respond to a watchdog liveness check:

```bash
spacetime sql --confirmed true --anonymous --server local --yes spacetimedb-db-name select * from table
```

so an attempt is made to restart the server:

```bash
spacetime start --pg-port pg-port --non-interactive
```

and attempts to restart the server fail due to lost writes to `metadata.toml`:

```log
spacetimedb-standalone version: 2.1.0
spacetimedb-standalone path: /root/.local/share/spacetime/bin/2.1.0/spacetimedb-standalone
database running in data directory /root/.local/share/spacetime/data
...
Error: failed reading metadata.toml

Caused by:
    TOML parse error at line 1, column 1
      |
    1 | 
      | ^
    missing field `version`
```

Note that there are multiple files with lost data but STDB startup always fails on the initial read/parse of `metadata.toml`.

A test run where STDB becomes unresponsive/hung, due to losing un-fsync'd writes, and is not restartable:

<img width="900" height="400" alt="Image" src="https://github.com/user-attachments/assets/abc79c37-0e57-4795-8e8b-720fa6a322da" />

----

## Simulated Power Glitch

To show that other essential files besides `metadata.toml` can lose data due to un-fsync'd writes, let's do a test that simulates a power glitch *after* we manually `checkpoint` `lazyfs` to persist the writes to `metadata.toml`:

- let db do some work, i.e. writes
- `checkpoint` the `lazyfs` filesystem
  - maybe triggered by application, os, etc.
  - write un-fsync'd writes to the underlying filesystem
- let db do some work, i.e. writes
- 🌩️ power glitch
  - kill the db's process
  - `lose-unfsynced-writes`
- ☀️ power normal
  - attempt to restart the db

An example of when the server is not able to be restarted due to data corruption:

```log
spacetimedb-standalone version: 2.1.0
spacetimedb-standalone path: /root/.local/share/spacetime/bin/2.1.0/spacetimedb-standalone
database running in data directory /root/.local/share/spacetime/data
...
[2m2026-04-19T01:53:44.937989Z[0m [31mERROR[0m [2m/home/runner/work/SpacetimeDB/SpacetimeDB/crates/client-api/src/lib.rs[0m[2m:[0m[2m623[0m[2m:[0m internal error: error starting database: failed to init replica 1 for c2007b1126eb275777e8d4bf0c9bf379d291e1d8a658bc0554faf997ad2c0ee2: /root/.local/share/spacetime/data/replicas/1/clog/00000000000000000000.stdb.log [extracting segment metadata]: segment header does not start with magic: expected [28, 64, 73, 29, 5e, 32], got [01, 00, 00, 00, 00, 00]
...
... repeats until end of test
...
```

A test run simulating a power glitch where STDB is not restartable due to data corruption:

<img width="900" height="400" alt="Image" src="https://github.com/user-attachments/assets/71a90feb-cbbc-4f66-bc88-a2fbd238217e" />

----

## `SIGTERM` vs `SIGKILL`

Using `SIGTERM` vs `SIGKILL` to request a graceful shutdown of the SpacetimeDB process does not change the test results, i.e. some writes are not fsync'd, and if lost, the server cannot be restarted.

----

As is apparent, I've been investing the space/time, 😸, to see what a Jepsen test for SpacetimeDB would look like.
It's a lot of work to do correctly and comprehensively, but a Jepsen test is always so valuable.

Thanks for SpacetimeDB!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

un-fsync'd writes #4886

un-fsync'd writes

`unsynced-data-report`

`lose-unfsynced-writes`

Simulated Power Glitch

`SIGTERM` vs `SIGKILL`

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

un-fsync'd writes #4886

Description

un-fsync'd writes

unsynced-data-report

lose-unfsynced-writes

Simulated Power Glitch

SIGTERM vs SIGKILL

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`unsynced-data-report`

`lose-unfsynced-writes`

`SIGTERM` vs `SIGKILL`