Filesystems which use buffer-heads where it cannot guarantees that there are no other references to the folio, for example with a folio lock, must use buffer_migrate_folio_norefs() for the address space mapping migrate_folio() callback. There are only 3 filesystems which use this callback:
- the block device cache
- ext4 for its ext4_journalled_aops, ie, jbd2
- nilfs2
jbd2's use of this however callback however is very race prone, consider folio migration while reviewing jbd2_journal_write_metadata_buffer() and the fact that jbd2:
- does not hold the folio lock
- does not have have page writeback bit set
- does not lock the buffer
And so, it can race with folio_set_bh() on folio migration. The commit ebdf4de5642fb6 ("mm: migrate: fix reference check race between __find_get_block() and migration") added a spin lock to prevent races with page migration which ext4 users were reporting through the SUSE bugzilla bnc#1137609 .
Although we don't have exact traces of the original filesystem corruption we can can reproduce filesystem corruption on ext4 on Linus' tree today on v6.15-rc2, that is with commit ebdf4de5642fb6 merged, by running the generic/750 for about ~ 20 hours on ext4 2k block size filesystem profile.
This is easily reproducible with kdevops using:
make defconfig-ext4_2k SOAK_DURATION=432000
make -j128
make bringup
make linux
make fstests
make fstests-baseline TESTS="generic/750"
See the traces/ directory.
We now have a slew of traces collected for the ext4 corruptions possible, we've used ChatGPT provide a summary of them:
do_writepages() # write back -->
ext4_map_block() # performs logical to physical block mapping -->
ext4_ext_insert_extent() # updates extent tree -->
jbd2_journal_dirty_metadata() # marks metadata as dirty for
# journaling. This can lead
# to any of the following hints
# as to what happened from
# ext4 / jbd2
- Directory and extent metadata corruption splats or
- Failure to handle out-of-space conditions gracefully, with
cascading metadata errors and eventual filesystem shutdown
to prevent further damage.
- Failure to journal new extent metadata during extent tree
growth, triggered under memory pressure or heavy writeback.
Commonly results in ENOSPC, journal abort, and read-only
fallback. **
- Journal metadata failure during extent tree growth causes
read-only fallback. Seen repeatedly on small-block (2k)
filesystems under stress (e.g. fsstress). Triggers errors in
bitmap and inode updates, and persists in journal replay logs.
"Error count since last fsck" shows long-term corruption
footprint.
Call trace (ENOSPC journal failure):
do_writepages()
→ ext4_do_writepages()
→ ext4_map_blocks()
→ ext4_ext_map_blocks()
→ ext4_ext_insert_extent()
→ __ext4_handle_dirty_metadata()
→ jbd2_journal_dirty_metadata() → ERROR -28 (ENOSPC)
And so jbd2 still needs more work to avoid races with folio migration.