Skip to content

Explorations

0trust3r edited this page Apr 3, 2025 · 5 revisions

This page houses various architecture explorations.

DB

We need to settle on a data architecture and database engine. Our most complex requirement is that we must be able to reconstruct the state for any given timeslot in the past 24 hours, and track multiple states across several forks.

PVM Invocations by Activity

Activity is defined here as a role that our validator plays in the network. During normal validation operations, we can expect to play all of the following roles defined below. All of these roles span both the conductor & pvm, although the contexts in which the pvm is invoked will vary. The invocation associations are detailed in the table below.

Producing blocks doesn't use accumulate or on-transfer contexts, since we use the same importing blocks capability regardless of if we produced the block

Activity is-authorized refinement accumulate on-transfer
importing blocks X X
producing blocks X X
refining X X
auditing X X

Data Access by Activity

State Item importing blocks producing blocks refining auditing
recent block details rw r r
safrole rw r
service accounts rw r r r
entropy rw r r
future validators rw
current validators rw
previous validators rw
pending reports rw r
auth pool rw r r r
slot rw r
auth queue rw r
disputes rw
privileged services rw r
validator activity stats
accumulation queue rw r
accumulation history rw r r
root rw r r
reporters rw
available reports rw r
accumulation root rw

State Accessor

We can use a state accessor as a sidecar to our activity processes. The accessor provides a way to read & write to state from our DB (and fallback to reading from one of our DA systems on a db-miss). When we receive an item from a DA sysem, we write the item to the db after passing it back to the caller. Since updates to state happen as part of block importing, we should be able to avoid race conditions by using the timeslot state as a locking mechanism.

Architecture

DB Comparison

Our DB must be fast, have excellent support for both python & rust clients, be battle tested, and expected to stand the test of time. Ideally it will support some form of delta-based versioning as a first-class citizen, which would significantly reduce the overhead required to support following multiple forks. Below is a comparison of our top db choices, along with a ranking for which ones we will pursue for the state accessor poc.

DB rollback/versioning? design language contributors (non-trivial) gh-stars ranking
rocks checkpoints lsm C++ 12 29k 1
redis snapshot & AOF in-mem structs C 12 68k 2
dragonfly snapshot but no AOF in-mem structs C++ 11 27k NA
foundation backup & snapshot b-tree C++ 20 15k NA
lmdb copy file b-tree C ? NA ?
libmdbx copy file b-tree C/C++ 2 1k ?
sled ? b-tree Rust 3 8k ?

This code is a library that forms the core building block for a fast key-value server, especially suited for storing data on flash drives. It has a Log-Structured-Merge-Database (LSM) design with flexible tradeoffs between Write-Amplification-Factor (WAF), Read-Amplification-Factor (RAF) and Space-Amplification-Factor (SAF). It has multi-threaded compactions, making it especially suitable for storing multiple terabytes of data in a single database.

pebble was evaluated but not considered after some initial reseach. It's missing the checkpoints capability that rocksdb offers. It also seems opinionated towards cockroachdbs needs.

leveldb was evaluated but not considered since it's "receiving very limited maintenance"

LevelDB is a fast key-value storage library written at Google that provides an ordered mapping from string keys to string values. This repository is receiving very limited maintenance. We will only review the following types of changes

redis seems like a natural choice, but its AOF is limited, and has more overhead when compared to rocks snapshotting capabilities... LSM naturally enables more advanced snapshot abilities.

Redis is often referred to as a data structures server. What this means is that Redis provides access to mutable data structures via a set of commands, which are sent using a server-client model with TCP sockets and a simple protocol. So different processes can query and modify the same data structures in a shared way.

Faster than Redis when using multiple cores, but doesn't support AOF.

Fully compatible with Redis and Memcached APIs, Dragonfly requires no code changes to adopt. Compared to legacy in-memory datastores, Dragonfly delivers 25X more throughput, higher cache hit rates with lower tail latency, and can run on up to 80% less resources for the same sized workload.

Foundations b-tree architecture seems like it would add significant overhead compared to LSMs. It also doesn't support delta-based snapshotting.

FoundationDB is a distributed database designed to handle large volumes of structured data across clusters of commodity servers. It organizes data as an ordered key-value store and employs ACID transactions for all operations. It is especially well-suited for read/write workloads but also has excellent performance for write-intensive workloads. Users interact with the database using API language binding.

LMDB with heed

  • simple & performant embedded db based on the legendary bdb
  • lightning-fast b+ tree & native mmap implementation
  • resilient & doesn't need tuning
  • fork of LMDB with more features/enhancements
  • more actively maintained
  • LSM tree-like write performance with b+ tree-like read performance
  • less actively maintained but looks solid
  • unstable api (still in beta)

RocksDB POC

There are many deprecated python clients for rocksdb, but thankfully there's one library that seems really solid- rocksdict. It supports everything we need, including a bytes only mode. There are two main aspects tested in this exploration: how performant is rocksdict under load, and how does it handle forking from checkpoints? We explore these two concerns below. The code for these tests can be found under the explorations project in the tram repo.

Performance

This exloration aims to develop an understanding of how rocksdict performs under load. We simulate load by performing writes across a number of slots in a simulated slot series. We simulate 20k writes per slot, each one storing a random 1kB blob. This nets Approximately 20MB of random blob data being written per slot. In order to keep parity with our checkpoint usage, we create a checkpoint after each slot. After the load test, we iterate back through the checkpoints to verify none of the data was corrupted or lost. The average time per slot is around 1s. The db-internal file sizes are listed below (smaller files are kept out for brevity). In total, about 140 files were created for 10 slots, and the file sizes per slot line up approximately with our 20MB write throughput. The rocks perf exploration code can be found here

8.0K	./state/9/OPTIONS-000102
8.0K	./state/genesis/MANIFEST-000005
20K	./rocks-file-sizes.log
24K	./state/0/LOG
24K	./state/1/LOG
24K	./state/2/LOG
24K	./state/3/LOG
24K	./state/4/LOG
24K	./state/5/LOG
24K	./state/6/LOG
24K	./state/7/LOG
24K	./state/8/LOG
24K	./state/9/LOG
324K	./state/genesis/LOG
1.4M	./state/0/000015.sst
1.4M	./state/1/000024.sst
1.4M	./state/2/000033.sst
1.4M	./state/3/000042.sst
1.4M	./state/4/000051.sst
1.4M	./state/5/000060.sst
1.4M	./state/6/000069.sst
1.4M	./state/7/000078.sst
1.4M	./state/8/000087.sst
1.4M	./state/9/000096.sst
19M	./state/0/000009.sst
19M	./state/0/000011.sst
19M	./state/0/000013.sst
19M	./state/1/000018.sst
19M	./state/1/000020.sst
19M	./state/1/000022.sst
19M	./state/2/000027.sst
19M	./state/2/000029.sst
19M	./state/2/000031.sst
19M	./state/3/000036.sst
19M	./state/3/000038.sst
19M	./state/3/000040.sst
19M	./state/4/000045.sst
19M	./state/4/000047.sst
19M	./state/4/000049.sst
19M	./state/5/000054.sst
19M	./state/5/000056.sst
19M	./state/5/000058.sst
19M	./state/6/000063.sst
19M	./state/6/000065.sst
19M	./state/6/000067.sst
19M	./state/7/000072.sst
19M	./state/7/000074.sst
19M	./state/7/000076.sst
19M	./state/8/000081.sst
19M	./state/8/000083.sst
19M	./state/8/000085.sst
19M	./state/9/000090.sst
19M	./state/9/000092.sst
19M	./state/9/000094.sst
58M	./state/1/000016.sst
58M	./state/2/000025.sst
58M	./state/3/000034.sst
58M	./state/4/000043.sst
58M	./state/5/000052.sst
58M	./state/6/000061.sst
58M	./state/8/000079.sst
58M	./state/9/000088.sst
58M	./state/genesis/000097.sst
59M	./state/7/000070.sst

Checkpoints

This exploration aims to explore how rocksdb checkpoints can be used to support our forking requirements. We need to be able to follow multiple forks, each with distinct states. For simulating this, we load a tree into rocks. The tree structure is outlined below. Arrows are numbered indicating the order of node creation. A key finding of this exploration is that rocks holds a db lock on the original db that a client was opened against. This means we can't have more than one fork sourced from any single block. As a workaround, we should be able to walk back from the locked block until we find an unlocked block, and re-process blocks until we reach & follow the fork. It's worth noting that for this to occurr in practice, it would require three validators to produce a block building on the same parent. The checkpoint exploration code can be found here.

processing benny
keys: [b'alfred', b'benny']

processing chase
keys: [b'alfred', b'benny', b'chase']

processing dian
keys: [b'alfred', b'benny', b'chase', b'dian']

processing elly
keys: [b'alfred', b'benny', b'chase', b'dian', b'elly']

processing boris
keys: [b'alfred', b'boris']

processing claire
keys: [b'alfred', b'boris', b'claire']

processing chris
keys: [b'alfred', b'boris', b'chris']

processing diaz
keys: [b'alfred', b'boris', b'chris', b'diaz']

processing cory
IO error: lock hold by current process, acquire time 1741790751 acquiring thread 8631324480: ./tree/boris/LOCK: No locks available

Clone this wiki locally