╔═╗┬─┐┌─┐┌┐┌┬┌─┌─┐┌┐┌╔═╗╔═╗
╠╣ ├┬┘├─┤│││├┴┐├┤ │││╠╣ ╚═╗
╚ ┴└─┴ ┴┘└┘┴ ┴└─┘┘└┘╚ ╚═╝
Memory-safe ext4 + btrfs in Rust, from userspace
Block-level MVCC · RaptorQ self-healing · Adaptive conflict arbitration · Zero unsafe code
The problem: Linux filesystems are trapped in kernel space. ext4 is 30 years old with a global journal lock (JBD2) that serializes all writes. btrfs has better internals but remains kernel-only, hard to test, and impossible to extend from userspace. Both lack automatic corruption recovery; you run fsck after the fact and hope.
The solution: FrankenFS extracts the behavior of ext4 and btrfs from ~205K lines of Linux kernel C (v6.19) and re-implements it idiomatically in Rust as a FUSE filesystem. The tracked V1 parity matrix is complete, and the current runtime can read real ext4/btrfs disk images and mount both in experimental mode (default read-only, optional --rw) while operational hardening continues.
| What | How | Why it matters |
|---|---|---|
| Block-level MVCC | Version chains per block, snapshot isolation, adaptive conflict policy (Strict/SafeMerge/Adaptive with expected-loss decision model) | Concurrent readers + writers without the JBD2 global lock. Safe-merge proofs allow non-conflicting concurrent writes to the same block. |
| RaptorQ self-healing | Fountain-coded repair symbols (RFC 6330), Bayesian durability autopilot, adaptive refresh (age + block-count hybrid trigger), scrub-and-recover pipeline | Corruption can be detected and repaired via the ffs repair / ffs fsck CLI path today; ffs mount now owns a detection-only ScrubDaemon lifecycle for read-only mounts by default, with explicit --background-scrub / --no-background-scrub controls. Stale-window SLO monitoring. |
| Writeback-cache readiness | Epoch-based commit barriers with per-inode deferred visibility, 12-scenario crash consistency proof | Future FUSE writeback-cache enablement without violating MVCC snapshot isolation or durability guarantees. |
| Memory safety | #![forbid(unsafe_code)] at every crate root, Rust 2024 edition |
Eliminates the buffer overflows and use-after-free bugs that plague kernel C filesystem code. |
| Userspace FUSE | Runs as a normal process via FUSE | Debug with standard tools. No kernel module loading. No reboot-on-crash. |
# Clone and build
git clone https://github.com/Dicklesworthstone/frankenfs.git
cd frankenfs
cargo build --workspace
# Inspect an ext4 image
cargo run -p ffs-cli -- inspect /path/to/ext4.img --json
# Show filesystem superblock + optional detailed sections
cargo run -p ffs-cli -- info /path/to/ext4.img --groups --mvcc --journal --json
# Inspect a btrfs image
cargo run -p ffs-cli -- inspect /path/to/btrfs.img --json
# Run conformance checks against real filesystem images
cargo run -p ffs-harness -- check-fixtures
cargo run -p ffs-harness -- parity
# One-command self-healing adoption wedge (no FUSE, temp raw image)
cargo run --bin ffs-demo -- self-healing
# Full CI gate
cargo fmt --check
cargo check --all-targets
cargo clippy --all-targets -- -D warnings
cargo test --workspaceFrankenFS does not translate C line-by-line. The porting doctrine is:
- Extract behavior from legacy kernel code into structured spec documents
- Design idiomatic Rust architecture from the spec
- Implement from the spec (not by copying C control flow)
- Validate via conformance harness against real filesystem images
This produces code that is Rust-native rather than "C with Rust syntax."
Every I/O operation takes an &asupersync::Cx capability context. This enables cooperative cancellation, deadline propagation, and deterministic testing under a lab runtime. No global state, no hidden singletons.
For high-risk subsystems, FrankenFS uses principled decision models rather than tuned constants:
- MVCC conflict resolution: Expected-loss decision rule selects between Strict FCW and SafeMerge based on observed contention (EMA-tracked conflict rate, merge success rate, abort rate)
- Repair symbol overhead: Bayesian Beta posterior over per-block corruption probability, minimizing
P(unrecoverable) * data_loss_cost + overhead * storage_cost - Repair refresh triggers: Expected-loss comparison of age-only vs block-count vs hybrid policies across workload profiles, with decision boundary identification
- Writeback-cache policy: Expected-loss decision matrix scoring semantic violation probability vs operational cost
If a heuristic must be used, the spec documents why formal alternatives were not viable.
Parser crates are pure (no I/O). MVCC knows nothing about files. FUSE knows nothing about on-disk formats. Repair operates on blocks, not inodes. Each concern lives in exactly one crate.
#![forbid(unsafe_code)] is set at every crate root and enforced as a workspace lint. There are no exceptions and no plans for exceptions.
| FrankenFS | Linux ext4 (kernel) | Linux btrfs (kernel) | ext4fuse | fuse-ext2 | |
|---|---|---|---|---|---|
| Language | Rust | C | C | C | C |
| Runs in | Userspace (FUSE) | Kernel | Kernel | Userspace (FUSE) | Userspace (FUSE) |
| Memory safety | forbid(unsafe_code) |
Manual | Manual | Manual | Manual |
| ext4 support | Read + experimental write | Full | N/A | Read-only | Read-write |
| btrfs support | Read + experimental write | N/A | Full | N/A | N/A |
| Both formats | Yes | No | No | No | No |
| Concurrent writes | MVCC with adaptive policy | JBD2 (global lock) | COW B-tree | N/A | Single-writer |
| Self-healing | RaptorQ + Bayesian autopilot | None (run fsck) | Scrub + mirrors | None | None |
| Conflict resolution | Safe-merge proofs + expected-loss | N/A | N/A | N/A | N/A |
| Debuggable | Standard userspace tools | printk + crash dump | printk + crash dump | gdb | gdb |
FrankenFS is a 21-crate Cargo workspace with a strict DAG dependency graph:
Layer 1 (Foundation): [ffs-types] [ffs-error]
\ /
Layer 2 (On-disk): [ffs-ondisk] [ffs-mvcc]
/ | \ |
Layer 3 (Storage): [ffs-block] [ffs-btree] [ffs-xattr]--+
(+ ARC) |
Layer 4 (Alloc): [ffs-alloc]
|
Layer 5 (Mid): [ffs-journal] [ffs-repair] [ffs-extent] [ffs-inode]
|
Layer 6 (Dir): [ffs-dir]
Layer 7 (Core): [ffs-core] <-- orchestrates everything
/ \
Layer 8 (Interface): [ffs-fuse] [ffs] (public facade)
/ | \
Layer 9 (Tooling): [ffs-cli] [ffs-tui] [ffs-harness]
| Layer | Crates | What it does |
|---|---|---|
| Foundation | ffs-types, ffs-error |
Newtypes (BlockNumber, InodeNumber, TxnId), 14-variant error enum, errno mappings |
| On-disk | ffs-ondisk |
Pure parsing of ext4 + btrfs superblocks, group descriptors, inodes, extents, B-tree headers. No I/O. |
| Storage | ffs-block, ffs-journal, ffs-mvcc |
Block I/O with ARC cache, JBD2-compatible journal replay, MVCC version chains with snapshot isolation, adaptive conflict policy (Strict/SafeMerge/Adaptive), merge-proof resolution, sharded concurrent store, WAL persistence |
| Tree / Alloc | ffs-btree, ffs-alloc, ffs-extent |
B+tree search/insert/split/merge, mballoc-style multi-block allocator (buddy system), extent mapping (logical-to-physical) |
| Namespace | ffs-inode, ffs-dir, ffs-xattr |
Inode lifecycle, directory ops (linear scan + htree), extended attributes (user/system/security/trusted) |
| Interface | ffs-fuse, ffs-core, ffs |
FUSE protocol adapter, engine integration (format detection, mount orchestration, writeback epoch barrier, Bayesian durability autopilot), public API facade |
| Repair | ffs-repair |
RaptorQ symbol generation/recovery, background scrub, adaptive refresh (age + block-count hybrid), stale-window SLO monitoring, expected-loss policy comparison, multi-host ownership coordination |
| Tooling | ffs-cli, ffs-tui, ffs-harness |
CLI (inspect, info, dump, fsck, repair, mount, scrub, parity, evidence), live TUI monitoring, conformance test harness + benchmarks + metrics framework |
- Parser crates are pure.
ffs-ondiskperforms no I/O. It parses byte slices into typed structures. - MVCC is transport-agnostic.
ffs-mvccknows nothing about FUSE, files, or directories. - FUSE delegates to
FsOps.ffs-fusemaps FUSE protocol to anffs-core::FsOpsimplementation (currentlyOpenFs) and contains no filesystem logic. - Repair is orthogonal.
ffs-repairoperates on blocks, not files. It doesn't know about inodes or directories. - Repair wiring is lifecycle-based.
ffs-corereaches repair functionality viaffs-mvcc/block flush integration rather than a directffs-core -> ffs-repairdependency edge. - No dependency cycles. The crate graph is a strict DAG.
Cxeverywhere. Any operation that performs I/O or may block takes&asupersync::Cxas its first parameter.
userspace read(fd, buf, count)
-> kernel FUSE -> fuser -> ffs-fuse::read()
-> ffs-core FsOps (OpenFs): flavor dispatch (ext4/btrfs)
-> extent/chunk mapping + block reads (ffs-extent, ffs-btree, ffs-block)
-> flavor-specific inode/file assembly in ffs-core
-> fuser -> kernel -> userspace
userspace write(fd, buf, count)
-> kernel FUSE -> fuser -> ffs-fuse::write()
-> ffs-core FsOps (OpenFs): flavor dispatch (ext4/btrfs), requires mount --rw
-> allocation + extent/tree updates (ffs-alloc, ffs-extent, ffs-btree)
-> block writes (ffs-block) and filesystem-level metadata updates
-> MVCC commit with adaptive conflict policy (merge-proof resolution)
-> journal/repair integration paths where enabled by operation
-> ffs-core: return bytes written
-> fuser -> kernel -> userspace
ffs-repair::scrub() [background]
-> ffs-block: read all blocks in group
-> checksum verification (crc32c or BLAKE3)
-> MISMATCH on block N
-> ffs-repair: load repair symbols
-> asupersync RaptorQ decode
-> recovered block data
-> ffs-block: write corrected block
-> ffs-repair: refresh symbols (hybrid age + block-count trigger)
-> report: { block: N, status: recovered }
Traditional FUSE filesystems serialize all writes through a single lock. FrankenFS eliminates this bottleneck with block-level Multi-Version Concurrency Control (MVCC) and a novel safe-merge system that allows non-conflicting concurrent writes to the same block.
Every logical block maintains a version chain: an ordered sequence of BlockVersion entries, each tagged with a CommitSeq and a writer TxnId. Readers acquire a snapshot (Snapshot { high: CommitSeq }) and see only versions with commit_seq <= snapshot.high. Writers accumulate staged writes in a Transaction and attempt to commit atomically.
When a writer commits and discovers that a block it wrote has been modified since its snapshot, the default response is to abort (First-Committer-Wins). But many concurrent writes don't actually conflict at the byte level. Two writers might be appending to different regions of the same block, or updating disjoint metadata fields in the same inode block.
FrankenFS introduces merge proofs, structured evidence that two writes can be combined without data loss:
| Merge Proof | Use case | Merge strategy |
|---|---|---|
AppendOnly { base_len } |
Log-structured appends, directory entry additions | Concatenate: keep the committed writer's prefix, append the new writer's tail |
IndependentKeys { touched_ranges } |
Disjoint metadata field updates | Overlay: copy each writer's byte ranges onto the committed base |
NonOverlappingExtents { touched_ranges } |
Extent tree updates to different file regions | Same overlay strategy, scoped to extent blocks |
TimestampOnlyInode { touched_ranges } |
Concurrent setattr on different inode timestamp fields |
Same overlay, validated for inode-specific byte layouts |
DisjointBlocks |
Transactions touching completely different blocks | Trivially non-conflicting (no same-block overlap) |
Unsafe |
No proof available | Always aborts on conflict (FCW fallback) |
The merge algorithm in MergeProof::merge_bytes() takes three inputs: base (the version at the writer's snapshot), latest (the currently committed version), and staged (the writer's proposed bytes). It validates that the proof's byte ranges are pairwise disjoint and that the committed writer didn't modify any of the same ranges, then produces the merged result.
Rather than hardcoding a single strategy, FrankenFS uses an expected-loss decision model to choose between Strict (pure FCW) and SafeMerge at runtime:
E[loss_strict] = conflict_rate * abort_cost
E[loss_safe_merge] = P(corruption) * severity + conflict_rate * (1 - merge_success_rate) * abort_cost
Three EMA-smoothed metrics drive the decision:
- Conflict rate: fraction of commits that encounter a newer version (0.0 = no conflicts, 1.0 = every commit conflicts)
- Merge success rate: fraction of conflicts resolved by merge proof (vs. abort)
- Abort rate: fraction of commits that are aborted
During a configurable warmup period (default: 20 commits), the system defaults to SafeMerge. After warmup, the Adaptive policy selects whichever strategy has the lower expected loss. Under a 120-writer stress test, SafeMerge achieved 9.5x lower expected loss than Strict with zero data corruption.
For multi-threaded workloads, ShardedMvccStore partitions version chains across N shards (one RwLock<MvccShard> each). Writers to different block ranges proceed without contention. Multi-shard transactions acquire locks in sorted order to prevent deadlocks, and the commit sequence is a lock-free AtomicU64.
FrankenFS can detect corruption during scrub cycles and recover corrupted data from fountain-coded repair symbols through the explicit ffs repair / ffs fsck --repair paths. The ffs mount path also owns a detection-only background scrub lifecycle for read-only mounts, so mount-time monitoring can surface corruption without mutating image data or repair-symbol state.
Each block group stores a configurable overhead of repair symbols alongside its source data blocks. RaptorQ is a rateless erasure code: given K source blocks, it generates as many repair symbols as needed. Any K of the combined source + repair symbols are sufficient to recover all K source blocks. This means FrankenFS can recover from arbitrary corruption patterns as long as the total number of lost blocks doesn't exceed the repair overhead.
The repair symbol overhead isn't a fixed constant. The DurabilityAutopilot maintains a Beta posterior distribution over the per-block corruption probability, updated from every scrub cycle observation:
posterior ~ Beta(alpha + corrupted, beta + clean)
The optimal overhead minimizes expected loss:
E[loss] = P(unrecoverable | overhead) * data_loss_cost + overhead * storage_cost
P(unrecoverable | overhead) is the Beta-Binomial tail probability that more than overhead * source_blocks blocks are simultaneously corrupted. The autopilot grid-searches the [min_overhead, max_overhead] range (default 3%--10%) for the minimum, with a 2x multiplier for metadata-critical groups.
Repair symbols become stale when source blocks are modified. FrankenFS supports four refresh policies:
| Policy | Trigger | Best for |
|---|---|---|
| Eager | Every write | Metadata groups (can't afford stale symbols) |
| Lazy | Age timeout (default 30s) or scrub cycle | Data groups under light writes |
| Adaptive | Switches Eager/Lazy based on corruption posterior | Groups with variable risk |
| Hybrid | First of: age timeout OR block-count threshold | Write-heavy groups needing tight staleness bounds |
The RefreshLossModel formally compares these policies using expected-loss calculations:
E[loss_age_only] = crash_rate * avg_stale_fraction * corruption_prob * data_loss_cost + refresh_io_cost / staleness_timeout
E[loss_block_count] = crash_rate * avg_stale_fraction * corruption_prob * data_loss_cost + refresh_io_cost * write_rate / threshold
E[loss_hybrid] = crash_rate * avg_stale_fraction * corruption_prob * data_loss_cost + refresh_io_cost / effective_window
Under heavy writes, the Hybrid policy achieves 83.3% reduction in p95 stale-window age compared to age-only, because the block-count trigger caps staleness at ~500 writes regardless of how fast they arrive.
The StaleWindowSlo provides percentile-based breach detection: a configurable SLO (default: p95 groups must have staleness < 60s AND < 5000 writes) is continuously evaluated against per-group telemetry. When breached, a structured repair_stale_window_slo_breach event is emitted with the offending percentile values, group counts, and threshold details.
FUSE kernel writeback-cache mode improves throughput by batching and reordering daemon write requests. This creates a tension with MVCC snapshot isolation: if writes arrive out of order, a reader might see a newer write before an older one that the application issued first.
| Scenario | Risk |
|---|---|
| Disjoint write batching | Request order becomes de facto MVCC order; swapped delivery breaks commit sequencing |
| Adjacent write merge | MVCC sees fewer mutation boundaries than the application issued |
| Delayed page writeback | Metadata ops commit against stale snapshots that exclude acknowledged data |
| Metadata overtakes data | Namespace durability overtakes data durability |
| Flush before writeback | V1 contract says flush is non-durable; must not advance visible state |
| Fsync with pending writeback | Fsync acknowledgment would overstate what is actually committed |
FrankenFS tracks three monotonically advancing epoch counters per inode:
staged_epoch >= visible_epoch >= durable_epoch
- Staged: dirty pages have arrived from the kernel
- Visible: committed to MVCC, admissible for snapshot readers
- Durable: synced to stable storage
Writes are staged into the current global epoch. Only fsync / fsyncdir advances visibility and durability. flush remains a non-durable lifecycle hook. Cross-epoch reordering is forbidden by construction.
The design specifies six invariants (I1--I6) that any future writeback-cache enablement must preserve, each backed by an executable checker:
- Snapshot Visibility Boundary: readers see only epochs that crossed the daemon visibility barrier
- Alias Order Preservation: writes to the same logical block preserve source order within an epoch
- Metadata-After-Data Dependency: metadata ops that depend on earlier data must not become visible first
- Sync Boundary Completeness: fsync/fsyncdir acknowledges only fully delivered + committed + synced epochs
- Flush Non-Durability: flush never advances visible or durable epoch
- Cross-Epoch Order: reordering may occur only within a single barrier epoch
The crash matrix exercises every combination of crash timing against the epoch state machine:
| # | Crash point | What survives |
|---|---|---|
| 1 | During buffered write (before commit) | Nothing; staged data lost |
| 2 | After commit, before device sync | Nothing; visible but not durable |
| 3 | After fsync completes | Everything; fully durable |
| 4 | During epoch advance | Previous epoch survives, new epoch lost |
| 5 | Concurrent inodes, partial sync | Only fsynced inodes survive |
| 6 | Multiple writes in single epoch | All lost if not committed |
| 7 | Fsync, more writes, crash | Fsynced data survives, post-fsync writes lost |
| 8 | Interleaved 3-inode epochs | Each inode recovers to its own durable epoch |
| 9 | Rapid epoch advances without writes | Only the last fsynced epoch matters |
| 10 | Commit at higher epoch than staged | Visibility advances to staged (not current) |
| 11 | Disabled barrier | Trivially consistent (no state tracked) |
| 12 | Complex multi-round sequence | Only fsynced inodes have durable data |
Recovery resets each inode to staged = visible = durable = last_durable_epoch. The invariant visible == durable is verified after every recovery, proving no partial epochs leak.
FrankenFS maintains a machine-readable audit trail for every significant decision across all subsystems.
The evidence ledger is an append-only JSONL file where each line is a self-contained EvidenceRecord with a nanosecond timestamp, event type, block group, and event-specific detail payload. The 23 event types span the full lifecycle:
| Category | Events |
|---|---|
| Corruption & Repair | CorruptionDetected, RepairAttempted, RepairSucceeded, RepairFailed, ScrubCycleComplete |
| MVCC Transactions | TransactionCommit, TxnAborted, SerializationConflict, VersionGc, SnapshotAdvanced |
| Merge Resolution | MergeProofChecked, MergeApplied, MergeRejected, PolicySwitched, ContentionSample |
| Durability Policy | PolicyDecision, SymbolRefresh, DurabilityPolicyChanged, RefreshPolicyChanged |
| Write-back & Flush | FlushBatch, BackpressureActivated, DirtyBlockDiscarded, WalRecovery |
The CLI provides four presets for common operator queries:
ffs evidence <ledger> --preset replay-anomalies # WAL recovery + aborts + SSI conflicts
ffs evidence <ledger> --preset repair-failures # Corruption + repair outcomes + scrub cycles
ffs evidence <ledger> --preset pressure-transitions # Backpressure + flush + policy changes
ffs evidence <ledger> --preset contention # Merge proofs + policy switches + contention samplesThe adaptive conflict policy tracks three EMA-smoothed rates that are periodically sampled to the evidence ledger (every 100 commits):
conflict_rate: how often commits hit a newer versionmerge_success_rate: how often conflicts are resolved by merge (vs. abort)abort_rate: how often commits are aborted overall
These metrics also drive the PolicySwitched event when the adaptive policy changes its effective strategy.
FrankenFS uses a multi-layered testing strategy with 3,591+ tests across 21 crates.
| Category | Count | What it validates |
|---|---|---|
| Unit tests | ~1,800 | Per-function correctness, edge cases, error conditions |
| Property-based (proptest) | ~50 | Invariants that must hold for all inputs (posterior bounds, monotonicity, commutativity) |
| Stress tests | ~15 | Concurrent correctness under high contention (120-writer merge, hotspot retry fairness) |
| Crash matrices | ~20 | Recovery correctness at every crash point (MVCC WAL, writeback epochs) |
| Conformance fixtures | ~30 | Golden-file validation against real ext4/btrfs images |
| Verification gates | 5 | Epic-level acceptance: safe-merge, adaptive refresh, writeback-cache, all-profile comparison, decision boundary |
Each major subsystem has a verification gate test that proves the system meets its quantitative acceptance criteria:
| Gate | Key assertion |
|---|---|
| Safe-Merge (bd-m5wf.3.5) | 120 writers, zero corruption, SafeMerge 9.5x lower expected loss |
| Adaptive Refresh (bd-m5wf.4.5) | >10% expected-loss improvement under high-risk params, p95 reduction 83.3% |
| Writeback-Cache (bd-m5wf.2.5) | 12-scenario crash matrix, epoch monotonicity, benchmark framework operational |
The asupersync runtime provides a LabRuntime with virtual time and Deterministic Partial Order Reduction (DPOR) for schedule exploration. MVCC stress tests use this to reproduce concurrency bugs deterministically across seeds, rather than relying on thread scheduling luck.
| Type | Crate | Purpose |
|---|---|---|
Cx |
asupersync | Capability context passed to all async/IO operations |
BlockNumber, InodeNumber, TxnId, CommitSeq |
ffs-types | Strongly-typed newtypes preventing mix-ups |
FfsError |
ffs-error | 14-variant error enum with errno mappings |
Superblock |
ffs-ondisk | On-disk superblock (ext4/btrfs format-aware) |
BlockDevice |
ffs-block | Block I/O abstraction trait |
Transaction |
ffs-mvcc | MVCC transaction with staged writes and merge proofs |
MergeProof |
ffs-mvcc | Structured evidence for safe concurrent writes |
ConflictPolicy |
ffs-mvcc | Strict / SafeMerge / Adaptive conflict resolution |
ContentionMetrics |
ffs-mvcc | EMA-tracked conflict/merge/abort rates |
MvccStore |
ffs-mvcc | Single-threaded version store with snapshot isolation |
ShardedMvccStore |
ffs-mvcc | Multi-threaded partitioned version store |
DurabilityAutopilot |
ffs-repair | Bayesian overhead optimizer (Beta posterior) |
RefreshLossModel |
ffs-repair | Expected-loss comparison of refresh trigger policies |
RefreshPolicy |
ffs-repair | Eager / Lazy / Adaptive / Hybrid refresh triggers |
StaleWindowSlo |
ffs-repair | Percentile-based freshness SLO with breach detection |
EvidenceRecord |
ffs-repair | Self-contained JSONL audit entry (23 event types) |
WritebackEpochBarrier |
ffs-core | Per-inode staged/visible/durable epoch tracking |
InodeEpochState |
ffs-core | Three-counter monotonic epoch state per inode |
OpenFs |
ffs-core | FsOps implementation orchestrating all subsystems |
The block layer (ffs-block) provides a pluggable I/O abstraction with an adaptive cache and coordinated write-back, sitting between the raw disk image and every higher-level subsystem.
All I/O flows through a five-method trait:
pub trait BlockDevice: Send + Sync {
fn read_block(&self, cx: &Cx, block: BlockNumber) -> Result<BlockBuf>;
fn write_block(&self, cx: &Cx, block: BlockNumber, data: &[u8]) -> Result<()>;
fn block_size(&self) -> u32;
fn block_count(&self) -> u64;
fn sync(&self, cx: &Cx) -> Result<()>;
}Every call takes &Cx, enabling cooperative cancellation and budget tracking even at the lowest I/O layer. A companion VectoredBlockDevice trait adds multi-block scatter/gather with a default implementation that delegates to scalar ops.
AlignedVec provides heap-allocated byte vectors with configurable alignment (default 4096 bytes), enabling O_DIRECT and avoiding memcpy penalties on Linux. BlockBuf wraps an AlignedVec and tracks logical-to-physical block mapping metadata.
ArcCache<D> wraps any BlockDevice with an Adaptive Replacement Cache (ARC), a self-tuning algorithm that balances recency and frequency by maintaining four LRU lists (T1, T2, B1, B2). ARC adapts its recency-vs-frequency split based on observed miss patterns, improving hit rates for mixed workloads without manual tuning.
The cache supports two write policies:
| Policy | Behavior | Use case |
|---|---|---|
| WriteThrough | Every write_block immediately hits the device |
Read-only mounts, simple correctness |
| WriteBack | Writes stay in cache until sync; dirty blocks cannot be evicted |
Write-heavy workloads requiring batched I/O |
Write-back mode uses a two-watermark backpressure model:
- High watermark (80% dirty ratio): triggers aggressive flush of all dirty blocks
- Critical watermark (95% dirty ratio): blocks new writes until dirty ratio drops below high watermark
A background flush daemon periodically writes dirty blocks in configurable batches, with budget-aware throttling that reduces batch size when the Cx poll quota is low (avoiding starvation of other cooperative tasks).
FlushPinToken provides a coordination mechanism: the MVCC layer pins specific blocks during commit, preventing them from being evicted or flushed until the transaction is fully visible. This ensures that a partially-committed transaction's blocks aren't flushed to disk in an inconsistent order.
FrankenFS supports both ext4 and btrfs from a single binary. Format detection happens at mount time by probing the superblock at format-specific offsets.
ext4: offset 1024, size 1024 bytes, magic 0xEF53 at offset 0x38
btrfs: offset 65536, size 4096 bytes, magic "_BHRfS_M" at offset 0x40
The OpenFs::open() constructor reads both regions, attempts parsing in each format, and returns a FsFlavor::Ext4(superblock) or FsFlavor::Btrfs(superblock). If neither magic matches, DetectionError::UnsupportedImage is returned. The detected flavor determines which code paths are used for every subsequent operation: extent mapping, inode parsing, directory traversal, and journal recovery all dispatch through it.
The ffs-ondisk crate takes &[u8] byte slices and returns typed structures, with zero I/O. This enables:
- Fuzz-friendly parsing: byte slices can be generated by proptest or AFL without needing real images
- Snapshot testing: parse results can be serialized to JSON for golden-file comparison
- Cross-platform unit testing: parsing tests run on any platform, not just Linux with FUSE
Superblock (106 fields), group descriptors (32-bit and 64-bit variants), inodes (mode, timestamps, extent header/entries, inline data flag), extent tree (header + index + leaf entries, up to 4 levels deep), journal superblock (JBD2 header, transaction IDs), feature flags (compat, incompat, RO-compat with named bitfields).
Superblock (including sys_chunk_array, backup roots), B-tree header (node level, generation, owner), leaf items (key + offset + size), chunk items (stripe mapping for single-device), root items (root tree, extent tree, fs tree, checksum tree references), and device items for geometry validation.
FrankenFS implements two journal recovery systems: JBD2 replay for ext4 compatibility-mode images, and a native WAL (Write-Ahead Log) for MVCC transactions.
When mounting an ext4 image with a dirty journal (the needs_recovery flag is set), FrankenFS replays committed transactions from the JBD2 journal area:
- Scan: read journal blocks starting from the journal superblock's
s_startoffset - Parse: identify descriptor blocks (which list the target blocks for each transaction), commit blocks (which seal a transaction), and revoke blocks (which cancel earlier writes)
- Apply: for each committed transaction (descriptor + commit pair found), write the journaled data blocks to their target locations on disk
- Revoke: honor revoke records by skipping any target blocks that were later revoked
- Finalize: clear the
needs_recoveryflag and reset journal sequence numbers
The replay is idempotent; replaying an already-clean journal is a no-op.
The native WAL in ffs-mvcc provides crash recovery for MVCC version-chain state:
- WAL segments: variable-length records with CRC32c integrity, storing committed transaction data (block writes, merge proofs, commit sequences)
- WAL writer: background task that batches pending records and flushes them to a WAL file with configurable sync policy
- WAL replay: on startup, replays WAL records to reconstruct the in-memory MVCC version store, skipping records already applied (idempotent replay with sequence-based deduplication)
- Crash matrix validation: 5 crash points (before record visible, after record before checksum, after checksum before sync, after sync before publish, repeated crash replay) each verified to produce correct recovery
FrankenFS uses a unified 14-variant error enum (FfsError) that maps cleanly to both POSIX errno values (for FUSE responses) and structured diagnostics (for evidence logging).
| Variant | errno | When it fires |
|---|---|---|
Io |
EIO | OS-level I/O failure |
Corruption |
EIO | Checksum mismatch, invalid metadata at a known block |
Format |
EINVAL | Wrong filesystem type, unsupported format version |
Parse |
EINVAL | On-disk structure doesn't decode |
UnsupportedFeature |
EOPNOTSUPP | Image requests a feature or mode this build still rejects (unknown incompat bits, unsupported mutation modes, unavailable write-side contracts) |
IncompatibleFeature |
EINVAL | Required compat bits missing or unknown incompat bits set |
UnsupportedBlockSize |
EINVAL | Block size outside 1K/2K/4K range |
InvalidGeometry |
EINVAL | Blocks-per-group, inodes-per-group, or other structural parameter out of range |
MvccConflict |
EAGAIN | First-committer-wins conflict (retry with fresh snapshot) |
Cancelled |
EINTR | Cx budget exhaustion or explicit cancel |
NoSpace |
ENOSPC | No free blocks or inodes |
NotFound |
ENOENT | File/directory/object lookup failed |
PermissionDenied |
EACCES | Insufficient permissions |
ReadOnly |
EROFS | Write attempted on read-only mount |
Additional variants handle directory semantics (NotDirectory / IsDirectory / NotEmpty / NameTooLong / Exists), repair failures (RepairFailed), and native-mode boundary violations.
Internal crate-specific errors (e.g., ParseError from ffs-types) are converted to FfsError at crate boundaries via From implementations. This ensures that every public API surface returns the unified type, while internal code can use more specific error types for precision.
FrankenFS uses the asupersync runtime instead of Tokio for all async and concurrent operations. The design requires properties that Tokio cannot provide.
| Requirement | asupersync | Tokio |
|---|---|---|
| Structured concurrency (no orphan tasks) | Scope + region() |
Manual JoinSet management |
| Cooperative cancellation via capability context | &Cx threaded through all calls |
CancellationToken (opt-in, not universal) |
| Cancel-correct channels (no data loss) | Two-phase reserve()/send() |
send() can lose data on cancel |
| Deterministic testing | LabRuntime with virtual time + DPOR |
Non-deterministic executor |
| Budget-aware operations | Cx::budget() with poll quotas |
No built-in budget mechanism |
Every I/O operation takes &Cx as its first parameter. The Cx carries:
- Budget: remaining poll quota and deadline, enabling cooperative yielding
- Cancellation: checked at every
cx.checkpoint()call, propagating cancel through the call stack - Deadline: operations automatically fail if the deadline expires
- Pressure: system pressure feedback for backpressure-aware algorithms
This eliminates ambient authority: no function can perform I/O without proving it has a valid context, and contexts cannot be fabricated without an explicit grant from the runtime.
LabRuntime provides a virtual-time executor where:
- Task scheduling is deterministic for a given seed
- DPOR (Dynamic Partial Order Reduction) explores different scheduling interleavings
- Timeouts use virtual time (tests run instantly, not wall-clock)
- Correctness oracles can assert invariants at every scheduling point
This is how FrankenFS stress tests (e.g., the 120-writer merge-proof test) achieve reproducible results: each seed produces the same scheduling interleaving, making concurrency bugs debuggable rather than intermittent.
The allocator (ffs-alloc) manages free space using bitmap-based tracking with goal-directed placement, inspired by ext4's mballoc but reimplemented in safe Rust.
Free/used state is tracked per-block in packed bitmaps (one bit per block per group). Core primitives:
bitmap_find_free(bitmap, start): scan forward for the first free bitbitmap_find_contiguous(bitmap, start, count): findcountconsecutive free bitsbitmap_count_free(bitmap): population count of zero bits
All operations are O(n) in group size but run on L1-cacheable data (a 32K-block group's bitmap fits in 4KB).
When allocating blocks for a file, the allocator uses a three-tier strategy:
- Goal group/block: if the caller provides an
AllocHint(e.g., "near block 1000 in group 3"), try that location first - Nearby groups: scan groups within a distance of 8, alternating +/- direction from the goal group
- Full scan: linear scan through all groups as a last resort
This locality-aware placement keeps related file extents physically contiguous, improving sequential read performance.
New directories use the Orlov allocator, which distributes directory inodes across groups to reduce contention. The algorithm biases toward groups with above-average free inode counts and low existing directory density, spreading the namespace tree across the disk.
The allocator enforces that metadata blocks (superblock copies, group descriptor tables, inode tables, bitmap blocks) are never allocated to file data. A two-phase validation marks reserved regions in a temporary bitmap before confirming any allocation.
File data in ext4 (and FrankenFS's native mode) is mapped via an extent tree, a compact B+tree stored in the inode's i_block field with optional overflow blocks.
Each extent maps a contiguous range of logical blocks to physical blocks:
ExtentMapping {
logical_start: u64, // File-relative block offset
physical_start: u64, // Disk-absolute block offset
count: u16, // Number of contiguous blocks (max 32768)
unwritten: bool, // Preallocated but not yet written (reads as zeros)
}
| Level | Location | Max entries | Coverage at 4KB blocks |
|---|---|---|---|
| Root | Inode i_block[0..14] (60 bytes) |
4 extents | 128 MB |
| Depth 1 | External blocks | 340 per block | ~43 GB |
| Depth 2 | Two-level index | 340 x 340 = 115,600 | ~14.5 TB |
| Depth 3 | Three-level index | ~39 million | ~4.9 PB |
- Lookup (
map_logical_to_physical): binary search at each tree level, returning either a mapping (physical block found) or a hole (sparse region, reads as zeros) - Insert (
allocate_extent): request contiguous physical blocks fromffs-alloc, create an extent, insert into the tree with mid-point splitting when nodes overflow - Delete (
truncate_extents): remove all extents beyond a logical boundary, free physical blocks, collapse empty index nodes - Mark written (
mark_written): clear the unwritten flag on preallocated extents, splitting at range boundaries if the write doesn't cover the entire extent (produces up to 3 replacement extents: left-unwritten + middle-written + right-unwritten) - Punch hole (
punch_hole): remove block mappings within a range without changing file size (sparse file creation)
Directory entries in ext4 use a two-level indexing scheme: a hash tree (htree) provides block-level indexing, while entries within each block are stored as a linked list.
+--------+---------+----------+-----------+------+
| inode | rec_len | name_len | file_type | name |
| (4B) | (2B) | (1B) | (1B) | (var)|
+--------+---------+----------+-----------+------+
Entries are 4-byte aligned. Deleted entries have inode = 0 and their rec_len is coalesced with the previous entry (space reclamation without compaction).
For directories with more than ~200 entries, ext4 uses a DX hash tree, a balanced B-tree indexed by a hash of the filename:
- Hash computation: half-MD4 or TEA hash (configurable per filesystem) with a 4-word seed for collision resistance
- Index lookup: binary search over
(hash, block)pairs to find the leaf block containing entries with matching hash values - Leaf scan: linear scan within the leaf block for the exact filename match (handles hash collisions)
The htree provides O(log n) lookup for large directories vs O(n) for linear scan.
add_entry: insert a new directory entry, reusing deleted slots when available. If the entry doesn't fit in the target block, the htree (if present) directs placement to the correct hash bucket.remove_entry: mark the entry deleted (inode = 0) and coalescerec_lenwith the previous entryinit_dir_block: create the initial.and..entries for a new directory
Extended attributes (xattrs) provide per-file key-value metadata outside the standard POSIX inode fields. FrankenFS supports the full ext4 xattr model.
| Namespace | Prefix | Permission required | Typical use |
|---|---|---|---|
| User | user.* |
File owner or CAP_FOWNER |
Application metadata |
| System | system.* |
CAP_SYS_ADMIN |
POSIX ACLs |
| Security | security.* |
CAP_SYS_ADMIN |
SELinux labels, IMA |
| Trusted | trusted.* |
CAP_SYS_ADMIN |
Privileged daemon data |
Kernel-reference coverage now differentially validates
system.posix_acl_access and system.posix_acl_default against debugfs, and
the FUSE E2E suite covers mounted-path list/get behavior for those POSIX ACL
xattrs plus the missing-default absence contract on regular files.
xattrs use a two-tier storage strategy:
- Inline (fast path): stored directly in the inode after
extra_isize, sharing the inode's block I/O. Limited by remaining inode space (~100-200 bytes for a 256-byte inode). - External block (overflow): a separate block pointed to by
inode.file_acl. Used when inline space is exhausted or values are large (up to 64KB per value).
The set operation tries inline first, then spills to external. The get operation checks both locations. This hybrid minimizes I/O for small attributes (the common case) while supporting arbitrarily large values.
xattr set operations support two modes matching the XATTR_CREATE and XATTR_REPLACE flags:
- Create: fails if the attribute already exists (for atomic create-if-absent)
- Replace: fails if the attribute does not exist (for atomic update-if-present)
The inode layer (ffs-inode) manages the complete lifecycle of filesystem objects from birth to deletion.
Inodes are stored in per-group inode tables. Given an inode number, the location is computed as:
group = (ino - 1) / inodes_per_group
index = (ino - 1) % inodes_per_group
block = inode_table_block[group] + (index * inode_size) / block_size
offset = (index * inode_size) % block_size
FrankenFS tracks four timestamps per inode with nanosecond precision:
| Field | When updated | ext4 field |
|---|---|---|
atime |
File read | i_atime + i_atime_extra (ns) |
mtime |
File data write | i_mtime + i_mtime_extra (ns) |
ctime |
Metadata change | i_ctime + i_ctime_extra (ns) |
crtime |
File creation | i_crtime + i_crtime_extra (ns) |
The _extra fields provide sub-second precision (nanoseconds) and extended epoch range (additional high bits for post-2038 timestamps).
Each inode includes a CRC32c checksum (i_checksum_lo + i_checksum_hi) computed over the inode's raw bytes with the filesystem UUID as salt. FrankenFS validates this checksum on read and recomputes it on write, detecting single-bit corruption in the inode table.
- Read (
read_inode): locate in inode table, read containing block, parse at byte offset, validate checksum - Write (
write_inode): recompute checksum, read-modify-write the containing block - Create (
create_inode): allocate from group's inode bitmap (viaffs-alloc), initialize fields, write to table - Delete: clear inode bitmap bit, zero timestamps, update group free counts
The scrub pipeline (ffs-repair::pipeline) scans block integrity, emits evidence for detected corruption, and orchestrates recovery when it is run in repair-enabled mode. Mount-time background scrub uses the same pipeline in detection-only mode, leaving block repair and repair-symbol refresh to explicit ffs repair / ffs fsck --repair operations.
- Block scan: read every block in the group, compute checksum (CRC32c for compat mode, BLAKE3 for native mode)
- Mismatch detection: compare computed checksum against stored checksum; any difference is flagged as corruption
- Severity classification: single-bit flips vs multi-byte corruption vs unreadable blocks
- Evidence logging: every detection emits a
CorruptionDetectedrecord with block ID, expected/actual checksums, and severity
When corruption is detected:
- Load repair symbols: read the group's RaptorQ encoded symbols from the repair tail region
- Decode: feed available source blocks + repair symbols into the RaptorQ decoder
- Validate: verify the recovered block's checksum matches
- Write back: replace the corrupted block with the recovered data
- Refresh symbols: re-encode repair symbols with the corrected data (generation number advances)
Every step produces a structured evidence record: RepairAttempted (with source/symbol counts), RepairSucceeded or RepairFailed (with decoder statistics), ScrubCycleComplete (with aggregate stats). The evidence ledger provides a complete forensic timeline of every corruption event and its resolution.
For shared storage (multiple hosts accessing the same image), the repair ownership system uses optimistic lease-based coordination:
- A coordination record (
.<image>.ffs-repair-owner.json) stores the owning host's UUID, hostname, and lease TTL - Hosts attempt to claim ownership before performing write-side repair operations
- Expired leases can be taken over with deterministic tie-breaking (UUID comparison)
- Read-only scrub (detection only) does not require ownership
The FUSE layer (ffs-fuse) translates kernel FUSE protocol messages into filesystem operations, managing request scoping, backpressure, and thread dispatch.
Every FUSE callback follows a three-phase lifecycle:
1. begin_request_scope(cx, op) -> acquire MVCC snapshot, check backpressure
2. execute operation -> dispatch to ext4/btrfs handler in ffs-core
3. end_request_scope(cx, scope) -> release snapshot, update metrics
The RequestScope captures the MVCC snapshot at request start, ensuring that all reads within a single FUSE callback see a consistent point-in-time view of the filesystem, even if concurrent writers are committing new versions.
When the system is under pressure (high dirty-cache ratio, long GC pauses, or external memory pressure), the BackpressureGate can shed load by deferring or rejecting non-critical requests:
- Read operations: always proceed (readers never block)
- Write operations: may be delayed when dirty ratio exceeds the high watermark
- Metadata operations: may be shed when system pressure reaches critical levels
The DegradationFsm (finite state machine) tracks pressure level transitions and ensures that degradation decisions are monotonic. The system degrades progressively rather than oscillating between healthy and degraded states.
MountOptions {
read_only: true, // Default: safe read-only mount
allow_other: false, // FUSE allow_other for multi-user access
auto_unmount: true, // Clean up on process exit
worker_threads: 0, // 0 = auto (min(available_parallelism, 8))
}The worker thread count maps to kernel FUSE queue tuning parameters (max_background and congestion_threshold), adjusting kernel-side request batching to match daemon capacity.
btrfs uses a fundamentally different on-disk structure than ext4. Instead of fixed-location group descriptors and inode tables, everything is stored in copy-on-write B-trees addressed by logical (virtual) block addresses that must be translated to physical disk offsets via a chunk mapping layer.
btrfs addresses blocks by logical address. To read a block, the logical address must be translated through the chunk map:
- Bootstrap: the superblock embeds a
sys_chunk_arraycontaining enough chunk entries to locate the chunk tree itself - Chunk lookup: find the chunk entry whose
[key.offset, key.offset + length)range contains the target logical address - Stripe calculation: for single-device images,
physical = stripe.offset + (logical - chunk.key.offset)
For RAID profiles (single, DUP, RAID0, RAID1, RAID5, RAID6, RAID10), the stripe calculation accounts for stripe width, sub-stripe interleaving, and mirror selection. FrankenFS V1 supports single-device profiles only.
walk_tree performs a depth-first traversal of any btrfs B-tree (root tree, extent tree, fs tree, checksum tree):
- Read the node at the given logical address (translate via chunk map)
- Parse the header: level, generation, number of items, owner
- If leaf (level 0): parse all items and collect them
- If internal (level > 0): recursively walk each child pointer in key order
- Cycle detection: maintain an
active_pathset of logical addresses; reject if a cycle is found - Depth bound: reject trees deeper than 7 levels (btrfs maximum)
- Visit deduplication:
visited_nodesset prevents re-reading shared subtrees (COW sharing)
This produces all leaf items in key order, which are then dispatched by item type (inode, dir entry, extent data, xattr, etc.) to build the filesystem's in-memory structures.
| Key type | Object ID | Item type | What it represents |
|---|---|---|---|
INODE_ITEM |
inode number | 1 | Inode metadata (size, mode, timestamps) |
DIR_ITEM |
parent inode | 84 | Directory entry (name + child inode) |
EXTENT_DATA |
inode number | 108 | File extent (inline data or disk reference) |
ROOT_ITEM |
tree ID | 132 | Root of a subvolume or internal tree |
CHUNK_ITEM |
logical offset | 228 | Chunk-to-physical mapping |
MVCC version chains grow with every write to a block. Without compression, a hot block with thousands of versions would consume unbounded memory. FrankenFS uses three strategies to keep chains compact.
When a new version's bytes are identical to the previous version (common for metadata blocks that are "touched" but unchanged), the version chain stores an Identical marker with zero data:
Version chain: [Full(4KB), Identical, Identical, Full(4KB), Identical]
Memory: 4096 0 0 4096 0 = 8192 bytes
Without dedup: 4096 4096 4096 4096 4096 = 20480 bytes
Resolution walks backward from the Identical marker to find the nearest Full or compressed version.
Versions can be stored as Zstd or Brotli compressed data:
| Variant | Compression | Decompression | Best for |
|---|---|---|---|
Full(Vec<u8>) |
None | None | Hot blocks accessed frequently |
Zstd(Vec<u8>) |
Zstd | Zstd | Cold blocks in long chains |
Brotli(Vec<u8>) |
Brotli | Brotli | Maximum compression ratio |
Identical |
None | Walk backward | Metadata blocks touched but unchanged |
The CompressionPolicy configures which algorithm to use and at what chain depth compression kicks in.
Configurable maximum chain length per block. When exceeded, the oldest versions beyond the cap are pruned, but only if no active snapshot still needs them. The GC watermark tracks the oldest active snapshot; versions older than the watermark and beyond the cap are retired via epoch-based reclamation (crossbeam-epoch), ensuring no reader ever sees a dangling reference.
A critical chain length (4x the cap) triggers backpressure: writers to that block are rejected with ChainBackpressure until GC catches up. This prevents unbounded memory growth even under pathological write patterns.
The ffs-harness crate provides the testing infrastructure that validates FrankenFS against real filesystem images and tracks feature parity quantitatively.
Real ext4 and btrfs images are impractical to check into git. FrankenFS uses sparse JSON fixtures instead, files containing only the non-zero byte regions of an image:
{"offset": 1024, "data": "0xEF530001..."} // ext4 superblock
{"offset": 2048, "data": "..."} // group descriptorsload_sparse_fixture() reconstructs a full-size byte buffer from these sparse entries, which can then be parsed by ffs-ondisk the same way a real image would be. This keeps fixtures under a few KB while covering the full parse surface.
Parse results are serialized to JSON and compared against golden files. If the parse output changes, the test fails with a diff showing exactly which fields changed. This catches regressions from refactoring and ensures that ext4/btrfs compatibility is maintained across code changes.
FEATURE_PARITY.md is both a human-readable document and a machine-parseable data source. The harness reads it to generate a quantitative parity report:
Domain Implemented Total Coverage
ext4 metadata parsing 27 27 100.0%
btrfs metadata parsing 27 27 100.0%
MVCC/COW core 14 14 100.0%
FUSE surface 19 19 100.0%
self-healing durability 10 10 100.0%
Every feature is either implemented (with a test ID), explicitly excluded (with a reason documented in the spec), or tracked as in-progress. No feature can silently fall out of scope.
The runtime metrics system provides three metric types, all based on lock-free atomic operations:
| Type | Operation | Use case |
|---|---|---|
| Counter | increment(n) |
Total operations, bytes transferred, errors |
| Gauge | set(val) / adjust(delta) |
Current cache size, active snapshots, dirty blocks |
| Histogram | observe(value) |
Latency distributions with fixed buckets |
Metrics are registered with a global MetricsRegistry that supports enable/disable toggle (zero overhead when disabled), rolling window snapshots for time-series analysis, and JSON export for integration with monitoring systems. A noop_handle provides a zero-cost placeholder for optional metric paths.
FrankenFS exposes several tuning knobs for operators and benchmark authors. All are configured via struct fields (no magic environment variables).
| Parameter | Default | Effect |
|---|---|---|
ConflictPolicy |
SafeMerge |
Strict for maximum safety, Adaptive for auto-tuning |
AdaptivePolicyConfig.ema_alpha |
0.1 | EMA smoothing (higher = more responsive to contention changes) |
AdaptivePolicyConfig.warmup_commits |
20 | Commits before adaptive policy activates |
CompressionPolicy.max_chain_length |
None | Version chain cap (enables GC when set) |
CompressionPolicy.dedup_identical |
true | Identical-version deduplication |
GcBackpressureConfig.min_poll_quota |
256 | Budget threshold for GC batch throttling |
| Parameter | Default | Effect |
|---|---|---|
ArcWritePolicy |
WriteThrough |
WriteBack for batched I/O (requires flush daemon) |
DIRTY_HIGH_WATERMARK |
0.80 | Dirty ratio triggering aggressive flush |
DIRTY_CRITICAL_WATERMARK |
0.95 | Dirty ratio blocking new writes |
FlushDaemonConfig.batch_size |
Configurable | Dirty blocks per flush cycle |
FlushDaemonConfig.interval |
Configurable | Sleep between flush cycles |
| Parameter | Default | Effect |
|---|---|---|
DurabilityAutopilot.min_overhead |
0.03 | Minimum repair symbol overhead (3%) |
DurabilityAutopilot.max_overhead |
0.10 | Maximum repair symbol overhead (10%) |
DurabilityAutopilot.metadata_multiplier |
2.0 | Extra overhead for metadata groups |
RefreshPolicy |
Lazy { 30s } |
Eager for metadata, Hybrid for write-heavy groups |
StaleWindowSlo.max_age_ms |
60,000 | SLO breach threshold (age) |
StaleWindowSlo.max_writes |
5,000 | SLO breach threshold (writes) |
StaleWindowSlo.percentile |
0.95 | Percentile for SLO evaluation |
| Parameter | Default | Effect |
|---|---|---|
MountOptions.read_only |
true |
Safe default; --rw for experimental writes |
MountOptions.worker_threads |
0 (auto) | min(available_parallelism, 8) |
MountOptions.allow_other |
false |
Multi-user FUSE access |
The porting doctrine is a concrete workflow with traceable artifacts at every step.
Legacy C code (e.g., fs/ext4/extents.c) is read for its behavioral contract, not its implementation. The output is a structured Markdown document (EXISTING_EXT4_BTRFS_STRUCTURE.md, 95KB) that captures:
- What each function does (not how it does it)
- What invariants it maintains
- What error conditions it handles
- What on-disk format constraints it enforces
The behavioral spec is mapped to a Rust crate/module structure (PROPOSED_ARCHITECTURE.md, 18KB). Key decisions:
- Which behaviors become traits vs concrete types
- Where crate boundaries go (parser vs I/O vs policy)
- What the dependency DAG looks like
- What the testing strategy is for each component
Code is written from the spec, not by translating C control flow. This means:
- **No
goto→ looppatterns**. Rust's?operator andmatch` replace C's error-handling gotos. - No manual memory management.
Vec,Box,Arcreplacekmalloc/kfree. - No global state.
&Cxreplaces the kernel's ambientcurrenttask context. - Enum-based dispatch replaces function pointer tables for format-specific operations.
The ffs-harness crate validates the implementation against real filesystem images using golden-file comparison. Sparse JSON fixtures capture expected parse results; the harness reads the same image and asserts field-by-field equality. Feature parity is tracked quantitatively in FEATURE_PARITY.md. Every feature is either implemented (with a test), explicitly excluded (with a reason), or marked as in-progress.
This doctrine produces code that is typically 3-5x more concise than the original C while handling the same behavioral surface. For example, the ext4 extent tree implementation handles the full 4-level tree structure in ~300 lines of Rust vs ~3,000 lines of kernel C, because Rust's type system, iterators, and error handling eliminate the boilerplate that dominates kernel code.
FrankenFS's security posture is defined by three hard constraints, not best-effort guidelines.
#![forbid(unsafe_code)] is set at every crate root and enforced as a workspace-level Clippy lint. This means:
- No buffer overflows (bounds-checked indexing)
- No use-after-free (ownership system prevents it)
- No uninitialized memory reads (all values initialized by construction)
- No data races (Rust's
Send/Syncsystem enforces at compile time)
The performance cost is negligible: FUSE protocol overhead (~10us per round-trip) dominates any bounds-check overhead (~1ns per access).
Every I/O operation requires an explicit &Cx capability. Code that doesn't have a Cx reference cannot perform I/O, read the clock, or sleep. This prevents:
- Hidden side effects in "pure" code paths
- Resource leaks from forgotten cancel handlers
- Accidental I/O in unit tests (test contexts have explicit budgets)
On mount, FrankenFS validates the image against a strict compatibility contract:
- Required feature flags must be present (
FILETYPEfor ext4) - All known incompat feature flags are accepted (COMPRESSION, JOURNAL_DEV, ENCRYPT, CASEFOLD, INLINE_DATA, etc.)
- Unknown incompat bits cause rejection
- Geometry parameters must be within supported ranges
- Superblock checksum must validate (ext4 CRC32c)
Images that fail validation are rejected with a specific error variant (UnsupportedFeature, IncompatibleFeature, InvalidGeometry). FrankenFS will not attempt to "best-effort" parse a potentially incompatible image.
# Requires Rust nightly (managed automatically via rust-toolchain.toml)
git clone https://github.com/Dicklesworthstone/frankenfs.git
cd frankenfs
cargo build --workspaceThe rust-toolchain.toml pins the nightly channel. Cargo handles the rest.
- Rust nightly (edition 2024, minimum version 1.85)
- Linux (FUSE target;
libfuse-devorfuse3for mount support) - FUSE headers:
sudo apt install libfuse-dev(Debian/Ubuntu) orsudo dnf install fuse-devel(Fedora)
# 1. Clone
git clone https://github.com/Dicklesworthstone/frankenfs.git
cd frankenfs
# 2. Build
cargo build --workspace
# 3. Run tests across the workspace
cargo test --workspace
# 4. Inspect a filesystem image
cargo run -p ffs-cli -- inspect /path/to/ext4.img --json
# 5. Run conformance parity report
cargo run -p ffs-harness -- parity# Inspect ext4 or btrfs image metadata (JSON output)
cargo run -p ffs-cli -- inspect <image-path> --json
# Show MVCC/EBR version-chain statistics
cargo run -p ffs-cli -- mvcc-stats <image-path> --json
# Show filesystem information (superblock + optional sections)
cargo run -p ffs-cli -- info <image-path> --groups --mvcc --repair --journal --json
# Dump low-level metadata structures
cargo run -p ffs-cli -- dump superblock <image-path> --json --hex
cargo run -p ffs-cli -- dump inode 2 <image-path> --json
cargo run -p ffs-cli -- dump extents 12 <image-path> --json
cargo run -p ffs-cli -- dump dir 2 <image-path> --json
# Mount an ext4 or btrfs image via FUSE (default read-only)
cargo run -p ffs-cli -- mount <image-path> <mountpoint>
# Enable experimental read-write mode
cargo run -p ffs-cli -- mount <image-path> <mountpoint> --rw
# Run a read-only scrub over image blocks
cargo run -p ffs-cli -- scrub <image-path> --json
# Run offline filesystem checks
cargo run -p ffs-cli -- fsck <image-path> --repair --json
# Run manual repair workflow
cargo run -p ffs-cli -- repair <image-path> --json
# Show current feature parity report
cargo run -p ffs-cli -- parity --json
# Inspect repair evidence ledger (JSONL) with presets
cargo run -p ffs-cli -- evidence <ledger-path> --json --tail 50
cargo run -p ffs-cli -- evidence <ledger-path> --preset contention
cargo run -p ffs-cli -- evidence <ledger-path> --preset repair-failures
cargo run -p ffs-cli -- evidence <ledger-path> --preset replay-anomalies
cargo run -p ffs-cli -- evidence <ledger-path> --preset pressure-transitions
# Create a new ext4 image (wraps mkfs.ext4 + validation)
cargo run -p ffs-cli -- mkfs <output-image> --size-mb 64 --block-size 4096 --label frankenfs --json# Validate conformance fixtures against golden data
cargo run -p ffs-harness -- check-fixtures
# Generate feature parity report
cargo run -p ffs-harness -- parity
# Run benchmarks
cargo bench -p ffs-harnessCanonical golden/conformance verification gate:
./scripts/verify_golden.shThat script is the checked-in gate used by CI for artifact integrity and
ffs-harness conformance checks. Its cargo-heavy steps are routed through
rch.
# These four commands must pass before any merge
cargo fmt --check
cargo check --all-targets
cargo clippy --all-targets -- -D warnings
cargo test --workspace| Crate | Role |
|---|---|
asupersync |
Structured async runtime, Cx capability contexts, deterministic lab runtime, RaptorQ codec |
ftui |
Terminal UI framework for ffs-tui |
crc32c |
ext4-compatible checksums |
blake3 |
Native-mode integrity checksums |
parking_lot |
Fast synchronization primitives |
crossbeam-epoch |
Epoch-based reclamation for MVCC version GC |
bitflags |
Filesystem flags and mode bits |
thiserror |
Error type derivation |
criterion |
Benchmark harness |
proptest |
Property-based testing for tree invariants and autopilot parameters |
FrankenFS is in early development. The tracked V1 parity matrix is complete (100%), meaning every item in FEATURE_PARITY.md's current denominator has an implemented and tested contract. Ongoing work is focused on operational hardening of three major subsystems that already reached verification-gate maturity:
| Subsystem | Status | Key metric |
|---|---|---|
| Safe-Merge Conflict Arbitration | Verified | 120-writer stress test, SafeMerge expected-loss 9.5x lower than Strict |
| Adaptive Repair Symbol Refresh | Verified | Hybrid policy p95 stale-window reduction 83.3% under heavy writes |
| FUSE Writeback-Cache Barriers | Verified | 12-scenario crash matrix, epoch monotonicity preserved |
| Domain | Coverage |
|---|---|
| ext4 metadata parsing | 100.0% (27/27) |
| btrfs metadata parsing | 100.0% (27/27) |
| MVCC/COW core | 100.0% (14/14) |
| FUSE surface | 100.0% (19/19) |
| self-healing durability policy | 100.0% (10/10) |
| Overall | 100.0% (97/97) |
Rows in the btrfs experimental RW contract can still say partially supported or unsupported without reducing tracked parity coverage when the expected V1 behavior is a deterministic partial-success or explicit rejection path that is implemented and tested.
- ext4: Superblock, inode, extent header/entry, group descriptor, feature flag decoding, mount-time journal recovery, FUSE mount (RO default, experimental RW)
- btrfs: Superblock, B-tree header, leaf item metadata, geometry validation, RAID stripe mapping, FUSE mount (RO default, experimental RW with core mutations)
- MVCC: Snapshot visibility, commit sequencing, first-committer-wins conflict detection, safe-merge proof resolution (AppendOnly, IndependentKeys, NonOverlappingExtents, TimestampOnlyInode, DisjointBlocks), adaptive conflict policy with EMA contention tracking, sharded concurrent store, WAL persistence and crash recovery
- Self-healing: Bayesian durability autopilot, RaptorQ symbol generation/recovery, hybrid refresh policy (age + block-count triggers), stale-window SLO monitoring with percentile-based breach detection, multi-host repair ownership coordination, expected-loss model for policy comparison
- Writeback-cache: Epoch-based commit barriers with per-inode staged/visible/durable tracking, deferred visibility for MVCC isolation, 12-scenario crash consistency proof, benchmark framework for barrier overhead measurement
- Observability: Evidence ledger (23 event types, 5 presets), contention metrics (EMA conflict/merge/abort rates), policy-switch detection, structured logging across all subsystems
- CLI:
inspect,mvcc-stats,info,dump,fsck,repair,mount,scrub,parity,evidence,mkfs - Testing: 3,591+ tests across 21 crates, including property-based tests, crash matrices, 120-writer stress tests, and verification gates
- Hot-path profiling and optimization (MVCC version chain pruning, extent tree caching, ARC hit rate)
- xfstests conformance baseline and root cause analysis
- CI-compatible FUSE E2E test runner
- Fuzz corpus expansion for WAL/MVCC/extent paths
- btrfs operational hardening for multi-device/subvolume paths and broader write-side coverage
See FEATURE_PARITY.md for the full capability matrix and PLAN_TO_PORT_FRANKENFS_TO_RUST.md for the 9-phase roadmap.
ext4: Single-device images with block sizes 1K/2K/4K. Requires FILETYPE feature flag (EXTENTS optional — indirect block addressing is supported). FUSE mount defaults to read-only; --rw is available but still experimental. All known incompat feature flags are accepted at mount time. COMPRESSION now includes ext4 e2compr read/write support for the implemented gzip/LZO/"none" method-table paths, with rare legacy codecs still rejected deterministically. JOURNAL_DEV images are detected as standalone journal devices; data filesystems that reference an external journal support paired-open replay via --external-journal, with UUID/block-size validation and fail-fast errors when required recovery cannot be performed safely. ENCRYPT shows filenames as raw bytes (nokey mode). CASEFOLD provides case-insensitive directory lookup. INLINE_DATA reads data from inode block area + system.data xattr.
btrfs: Single- and multi-device images with RAID0/1/5/6/10/DUP support. Metadata parsing + validation (superblock, leaf items, sys_chunk_array, chunk tree walking). The tracked V1 FUSE mount/runtime contract is fully covered, but the operator-facing mount path remains explicitly experimental (default read-only, optional --rw). Transparent decompression (ZLIB/LZO/ZSTD), named subvolume/snapshot selection via --subvol and --snapshot, tree-log replay, and send/receive stream parsing are implemented; CLI/operator-path tests now prove selected-root scoping before the FUSE mount stage.
| Operation class | Status | Contract |
|---|---|---|
Core mutations (create, mkdir, unlink, rmdir, rename, write, setattr, link, symlink, xattrs) |
Supported (experimental) | Deterministic success/error behavior under ffs-core + FUSE tests, including explicit xattr mode semantics (Create/Replace) |
fallocate (mode=0, FALLOC_FL_KEEP_SIZE) |
Partially supported | Preallocation paths are supported and validated |
fallocate (FALLOC_FL_PUNCH_HOLE|FALLOC_FL_KEEP_SIZE) |
Supported (experimental) | Success path zero-fills the requested range while preserving file size and unaffected bytes |
fallocate (FALLOC_FL_ZERO_RANGE, optional FALLOC_FL_KEEP_SIZE) |
Supported (experimental) | Success path zero-fills the requested range; KEEP_SIZE preserves EOF while non-KEEP_SIZE can extend file size |
fallocate (FALLOC_FL_COLLAPSE_RANGE) |
Supported (experimental) | Success path removes the requested aligned range, shifts the tail left, shrinks the file size, and preserves shifted prealloc extents as FIEMAP UNWRITTEN |
fallocate (FALLOC_FL_INSERT_RANGE) |
Supported (experimental) | Success path inserts an aligned hole, shifts the tail right, grows the file size, and preserves shifted prealloc extents as FIEMAP UNWRITTEN |
fallocate (unknown/extra mode bits) |
Unsupported | Must return EOPNOTSUPP (FfsError::UnsupportedFeature) with no partial data/size mutation before rejection |
| Unsupported-path observability | Required | Structured logs include operation_id, scenario_id, outcome, and error_class |
See COMPREHENSIVE_SPEC_FOR_FRANKENFS_V1.md for the full normative scope.
- Linux only. FUSE is the sole mount target. No macOS or Windows support planned.
- Nightly Rust required. Edition 2024 features require the nightly toolchain.
- Runtime is still early-stage. Full tracked parity means the current V1 matrix is implemented and tested; it does not mean operational hardening, performance tuning, or future-scope features are finished. Mount/write paths should still be treated as experimental in operational environments.
- Kernel FUSE writeback-cache mode is intentionally unsupported in V1.x. The epoch barrier design is implemented and crash-tested, but
writeback_cacheis not enabled in mount options.flushis a non-durability lifecycle hook;fsync/fsyncdirare the explicit durability boundaries. - Default CLI mount path does not enable optional backpressure/per-core scheduling hooks.
ffs-cli mountcurrently uses the standardffs-fusemount path without wiringBackpressureGatecontrols. - Mount background scrub is detection-only in V1.x.
ffs mountstartsffs-repair::ScrubDaemonautomatically for default read-only mounts, owns cancellation through the mount lifecycle, and joins the worker on shutdown. Read-write mounts keep the daemon disabled by default;--background-scrubcan opt into detection-only monitoring,--no-background-scrubdisables the read-only default, and--background-scrub-ledgerrecords evidence JSONL. RaptorQ symbol writes and block repair remain explicitffs repair/ffs fsck --repairoperations. - External dependencies. Workspace dependencies currently use crates.io releases (
asupersync = 0.2.5,ftui = 0.2.1); local path overrides can be supplied with Cargo[patch]during sibling-repo development. - Legacy reference corpus is not included. The Linux kernel ext4/btrfs source used for behavioral extraction (~205K lines) is gitignored due to size. The extracted behavioral contracts are fully captured in EXISTING_EXT4_BTRFS_STRUCTURE.md. For the original source, see
git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.gitat tag v6.19.
Q: Why reimplement ext4 and btrfs instead of just using them? A: Kernel filesystems can't be extended with MVCC or self-healing from userspace. FrankenFS is a research vehicle for exploring what ext4/btrfs could look like with modern concurrency control and erasure coding, while remaining mount-compatible with existing images.
Q: Can I mount real ext4/btrfs data with this today?
A: ffs mount supports both ext4 and btrfs images under the fully tracked V1 contract, but the runtime is still experimental operationally. Default behavior is read-only; --rw enables write paths that are still under active hardening. Do not rely on it for production data.
Q: What does "spec-first" mean? A: Instead of translating C to Rust line by line, we first extract the behavioral contract of each kernel subsystem into specification documents (~400KB of structured Markdown). Then we implement from the spec in idiomatic Rust. This avoids carrying over C-isms and allows architectural improvements.
Q: Why MVCC instead of the existing journal? A: ext4's JBD2 journal uses a global lock that serializes all writes through a single thread. Block-level MVCC with version chains allows concurrent writers with snapshot isolation. The adaptive conflict policy automatically selects between strict first-committer-wins and safe-merge resolution based on observed contention, minimizing both unnecessary aborts and corruption risk.
Q: What are fountain codes / RaptorQ? A: RaptorQ (RFC 6330) is a fountain code, an erasure coding scheme that generates repair symbols from source data. Given enough symbols, you can recover any lost/corrupted source blocks. FrankenFS stores a configurable overhead of repair symbols per block group (default 5%), enabling automatic corruption recovery without redundant copies. The Bayesian autopilot adjusts overhead based on observed corruption rates.
Q: Why forbid(unsafe_code) everywhere?
A: Filesystem bugs in C frequently involve buffer overflows, use-after-free, and uninitialized memory. By forbidding unsafe Rust entirely, we eliminate these categories of bugs at compile time. The performance cost is negligible for a FUSE filesystem (the FUSE protocol is already the bottleneck).
Q: What is "adaptive conflict policy"? A: When two transactions write the same block, FrankenFS can either abort the later writer (Strict FCW) or merge the writes using a proof that they don't conflict (SafeMerge). The Adaptive policy uses an expected-loss decision model that tracks conflict rate, merge success rate, and abort rate via exponential moving averages, then selects whichever strategy has the lowest expected cost. Under a 120-writer stress test, SafeMerge achieved 9.5x lower expected loss than Strict.
Q: How does self-healing refresh work?
A: Repair symbols become stale when source blocks are modified. FrankenFS supports three refresh triggers: age-only (time since last refresh), block-count (writes since last refresh), and hybrid (whichever fires first). The RefreshLossModel compares all three using expected-loss calculations across workload profiles, and the StaleWindowSlo monitors percentile staleness with configurable breach detection.
| Document | Size | What it covers |
|---|---|---|
| COMPREHENSIVE_SPEC_FOR_FRANKENFS_V1.md | 242KB | Canonical specification, 24 sections covering every subsystem |
| EXISTING_EXT4_BTRFS_STRUCTURE.md | 95KB | Behavioral extraction from Linux kernel ext4/btrfs source |
| PLAN_TO_PORT_FRANKENFS_TO_RUST.md | 69KB | 9-phase porting roadmap with scope and acceptance criteria |
| PROPOSED_ARCHITECTURE.md | 18KB | 21-crate architecture, trait hierarchy, data flow |
| FEATURE_PARITY.md | 3KB | Quantitative implementation coverage tracking |
| AGENTS.md | 10KB | Guidelines for AI coding agents working in this codebase |
| Document | What it covers |
|---|---|
| design-writeback-cache-mvcc.md | FUSE writeback-cache reordering model, 6 formal invariants, epoch fence state machine, expected-loss decision matrix |
| design-safe-merge-taxonomy.md | Safe-merge proof obligations for concurrent block writes |
| design-adaptive-refresh.md | Expected-loss model for age-only vs block-count vs hybrid refresh triggers |
| design-multi-host-repair.md | Optimistic lease-based repair ownership for shared storage |
FrankenFS was designed by extracting behavior from Linux kernel v6.19 filesystem source (~205K lines of C):
- ext4: superblock, inode, extent tree, journal (JBD2), block allocation (mballoc)
- btrfs: B-tree, transaction, delayed refs, scrub, extent allocation
The original kernel source is not included (gitignored due to size); for reference, use git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git at tag v6.19. All extracted behavioral contracts are captured in EXISTING_EXT4_BTRFS_STRUCTURE.md (95KB).
Please don't take this the wrong way, but I do not accept outside contributions for any of my projects. I simply don't have the mental bandwidth to review anything, and it's my name on the thing, so I'm responsible for any problems it causes; thus, the risk-reward is highly asymmetric from my perspective. I'd also have to worry about other "stakeholders," which seems unwise for tools I mostly make for myself for free. Feel free to submit issues, and even PRs if you want to illustrate a proposed fix, but know I won't merge them directly. Instead, I'll have Claude or Codex review submissions via gh and independently decide whether and how to address them. Bug reports in particular are welcome. Sorry if this offends, but I want to avoid wasted time and hurt feelings. I understand this isn't in sync with the prevailing open-source ethos that seeks community contributions, but it's the only way I can move at this velocity and keep my sanity.
