Skip to content

Block-based immutable list implementation (GSoC proposal)#809

Open
Zayd-R wants to merge 5 commits intotypelevel:masterfrom
Zayd-R:immutable-blocked-list-proposal
Open

Block-based immutable list implementation (GSoC proposal)#809
Zayd-R wants to merge 5 commits intotypelevel:masterfrom
Zayd-R:immutable-blocked-list-proposal

Conversation

@Zayd-R
Copy link

@Zayd-R Zayd-R commented Mar 23, 2026

Summary

This is an early-stage implementation of the block-based immutable list
proposed in #634, submitted as part of GSoC work. The goal of this PR is to share the implementation and benchmark data to explore the design space before committing to a final approach.

Two implementations explored

BlockedList — copy-on-write
Every prepend into dead space copies the valid portion of the block before
writing. Fully persistent and safe for branching use cases.

FastBlockedList — write-direct
Prepend writes directly into dead space (offset - 1) without copying,
since that slot has never been pointed at by any existing node and is
invisible to all observers. when the block is full, a fresh block is allocated and
the current node becomes the tail.

Both implementations store BlockSize per node to test it with many sizes in the benchmark file without
recompilation.

Results

All times in ns/op. Lower is better.

prepend (build a list of 10k elements from empty)

blockSize BlockedList FastBlockedList scala.List
4 138,526 ± 4,810 52,328 ± 218 40,419 ± 114
8 167,084 ± 1,601 54,333 ± 228 40,460 ± 296
16 193,820 ± 6,885 56,385 ± 341 39,556 ± 100
32 246,475 ± 2,368 54,803 ± 337 39,600 ± 311
64 378,280 ± 74,722 53,403 ± 226 39,592 ± 164

Copy-on-write prepend scales linearly with blockSize due to the arraycopy
cost. Write-direct prepend is flat across block sizes and only ~30% slower
than scala.List.

foreach (visit every element)

blockSize BlockedList (cow) FastBlockedList scala.List
4 15,149 ± 652 13,001 ± 808 22,114 ± 259
8 9,592 ± 283 10,498 ± 504 22,076 ± 322
16 8,442 ± 154 8,379 ± 137 21,762 ± 147
32 6,960 ± 128 6,864 ± 231 21,880 ± 91
64 5,786 ± 44 5,688 ± 40 21,685 ± 157

Both implementations beat scala.List by ~4x at blockSize=64. This
is the cache locality benefit the proposal predicted — larger blocks mean
longer tight array loops with fewer pointer jumps.

foldLeft (sum all elements)

blockSize BlockedList (cow) FastBlockedList scala.List
4 39,273 ± 654 32,715 ± 2,828 28,207 ± 972
8 38,107 ± 941 29,910 ± 4,056 27,148 ± 174
16 36,720 ± 302 32,043 ± 3,591 27,969 ± 589
32 36,689 ± 445 30,048 ± 3,890 27,263 ± 294
64 34,953 ± 457 28,918 ± 4,981 28,865 ± 261

FastBlockedList.foldLeft ties scala.List at blockSize=64
(28,918 vs 28,865 ns/op). The larger error margins suggest JIT
variance — more iterations would tighten these numbers.

uncons (element-by-element traversal)

blockSize BlockedList (cow) FastBlockedList scala.List
4 82,950 ± 6,042 70,404 ± 488 16,328 ± 202
8 84,576 ± 8,152 76,493 ± 545 16,232 ± 97
16 82,985 ± 14,101 74,295 ± 313 16,284 ± 144
32 83,485 ± 17,600 73,708 ± 1,606 16,470 ± 171
64 83,291 ± 15,568 72,358 ± 409 16,255 ± 208

uncons is slower than scala.List as expected — each call allocates
one Some and one Tuple2. As noted in the proposal, uncons is not
the intended traversal API. The foreach/foldLeft results above are
the relevant comparison.

map (apply a function to every element)

blockSize BlockedList scala.List
4 57,963 ± 1,036 55,880 ± 2,542
8 46,509 ± 9,182 58,043 ± 3,081
16 40,872 ± 5,839 58,697 ± 686
32 32,573 ± 357 58,086 ± 1,057
64 30,755 ± 831 58,367 ± 534

BlockedList.map beats scala.List by ~47% at blockSize=64 (30,755 ns vs 58,367 ns).
The improvement scales with blockSize — larger blocks mean more elements processed
per block , confirming cache locality advantage.
scala.List is flat across all block sizes as expected since it has no block structure.

Key findings

  • foreach validates the proposal's cache locality claim — ~4x faster
    than scala.List at blockSize=64 for both implementations
  • foldLeft ties scala.List at larger block sizes
  • Copy-on-write prepend is not practical at large block sizes due to
    linear arraycopy cost
  • Write-direct prepend is flat across block sizes and competitive with
    scala.List
  • blockSize=32 or 64 appears optimal for bulk traversal operations

Benchmark methodology

Tool: JMH (Java Microbenchmark Harness) via sbt-jmh plugin
Mode: Average time (AverageTime)
Units: nanoseconds per operation (ns/op) — lower is better
Warmup: 5 iterations
Measurement: 10 iterations
Forks: 1
Threads: 1

Environment: JVM [openjdk 25.0.2 2026-01-20 LTS],
CPU [Intel Core 5 210H],
RAM [16GB RAM],
OS [Ubuntu 22.04]

Lists are pre-built in @Setup(Level.Trial) so construction cost is
excluded from traversal measurements. The benchmark suite is included
in bench/src/main/scala/cats/bench/BlockedListBenchmark.scala and
can be reproduced with:

sbt "bench/jmh:run -i 10 -wi 5 -f 1 -t 1 .*BlockedList.*"

Questions

  • Is the write-direct approach in FastBlockedList acceptable , or should only the copy-on-write version be pursued?

Transparency note

English is not my first language. I used an LLM to help
with grammar and formatting in this PR description, and to generate the
initial benchmark boilerplate code. All implementation decisions, the
identification of bugs, the analysis of benchmark results, and the core
data structure logic were worked out by me. The AI was used as a writing
and tooling aid, not as a substitute for understanding.

Introduces BlockedList (copy-on-write) and BlockedLostCopy (write-direct)
as proposed in typelevel#634. Includes JMH benchmarks comparing both
implementations against scala.List across prepend, uncons, foldLeft,
and foreach.
@Zayd-R Zayd-R marked this pull request as ready for review March 23, 2026 15:20
@Zayd-R
Copy link
Author

Zayd-R commented Mar 23, 2026

I just noticed i named the implementaion that writes directly with Copy suffix, the name was just to differentiate it from my original copy on write implementation, srry for the confusion

@gemelen
Copy link
Collaborator

gemelen commented Mar 24, 2026

@Zayd-R thank you for working on this.

There are few things that I'd like you to fix in your changeset:

  • revisit your description about the PR, fix the typos and misnames (like BloackedLoistCopy, etc)
  • provide a description on the benchmarks - what tools did you use, what's the methodology, how to apply it to repeat the measuruments, what are the units in the results you provided (time, space, op/s, etc)
  • please, fix the issue raised by the CI on the missing headers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants