-
Notifications
You must be signed in to change notification settings - Fork 0
Explorations
This page houses various architecture explorations.
We need to settle on a data architecture and database engine. Our most complex requirement is that we must be able to reconstruct the state for any given timeslot in the past 24 hours, and track multiple states across several forks.
Activity is defined here as a role that our validator plays in the network. During normal validation operations, we can expect to play all of the following roles defined below. All of these roles span both the conductor & pvm, although the contexts in which the pvm is invoked will vary. The invocation associations are detailed in the table below.
Producing blocks doesn't use accumulate or on-transfer contexts, since we use the same importing blocks capability regardless of if we produced the block
| Activity | is-authorized | refinement | accumulate | on-transfer |
|---|---|---|---|---|
| importing blocks | X | X | ||
| producing blocks | X | X | ||
| refining | X | X | ||
| auditing | X | X |
| State Item | importing blocks | producing blocks | refining | auditing |
|---|---|---|---|---|
| recent block details | rw | r | r | |
| safrole | rw | r | ||
| service accounts | rw | r | r | r |
| entropy | rw | r | r | |
| future validators | rw | |||
| current validators | rw | |||
| previous validators | rw | |||
| pending reports | rw | r | ||
| auth pool | rw | r | r | r |
| slot | rw | r | ||
| auth queue | rw | r | ||
| disputes | rw | |||
| privileged services | rw | r | ||
| validator activity stats | ||||
| accumulation queue | rw | r | ||
| accumulation history | rw | r | r | |
| root | rw | r | r | |
| reporters | rw | |||
| available reports | rw | r | ||
| accumulation root | rw |
We can use a state accessor as a sidecar to our activity processes. The accessor provides a way to read & write to state from our DB (and fallback to reading from one of our DA systems on a db-miss). When we receive an item from a DA sysem, we write the item to the db after passing it back to the caller. Since updates to state happen as part of block importing, we should be able to avoid race conditions by using the timeslot state as a locking mechanism.

Our DB must be fast, have excellent support for both python & rust clients, be battle tested, and expected to stand the test of time. Ideally it will support some form of delta-based versioning as a first-class citizen, which would significantly reduce the overhead required to support following multiple forks. Below is a comparison of our top db choices, along with a ranking for which ones we will pursue for the state accessor poc.
| DB | rollback/versioning? | design | language | contributors (non-trivial) | gh-stars | ranking |
|---|---|---|---|---|---|---|
| rocks | checkpoints | lsm | C++ | 12 | 29k | 1 |
| redis | snapshot & AOF | in-mem structs | C | 12 | 68k | 2 |
| dragonfly | snapshot but no AOF | in-mem structs | C++ | 11 | 27k | NA |
| foundation | backup & snapshot | b-tree | C++ | 20 | 15k | NA |
| lmdb | copy file | b-tree | C | ? | NA | ? |
| libmdbx | copy file | b-tree | C/C++ | 2 | 1k | ? |
| sled | ? | b-tree | Rust | 3 | 8k | ? |
This code is a library that forms the core building block for a fast key-value server, especially suited for storing data on flash drives. It has a Log-Structured-Merge-Database (LSM) design with flexible tradeoffs between Write-Amplification-Factor (WAF), Read-Amplification-Factor (RAF) and Space-Amplification-Factor (SAF). It has multi-threaded compactions, making it especially suitable for storing multiple terabytes of data in a single database.
pebble was evaluated but not considered after some initial reseach. It's missing the checkpoints capability that rocksdb offers. It also seems opinionated towards cockroachdbs needs.
leveldb was evaluated but not considered since it's "receiving very limited maintenance"
LevelDB is a fast key-value storage library written at Google that provides an ordered mapping from string keys to string values. This repository is receiving very limited maintenance. We will only review the following types of changes
redis seems like a natural choice, but its AOF is limited, and has more overhead when compared to rocks snapshotting capabilities... LSM naturally enables more advanced snapshot abilities.
Redis is often referred to as a data structures server. What this means is that Redis provides access to mutable data structures via a set of commands, which are sent using a server-client model with TCP sockets and a simple protocol. So different processes can query and modify the same data structures in a shared way.
Faster than Redis when using multiple cores, but doesn't support AOF.
Fully compatible with Redis and Memcached APIs, Dragonfly requires no code changes to adopt. Compared to legacy in-memory datastores, Dragonfly delivers 25X more throughput, higher cache hit rates with lower tail latency, and can run on up to 80% less resources for the same sized workload.
Foundations b-tree architecture seems like it would add significant overhead compared to LSMs. It also doesn't support delta-based snapshotting.
FoundationDB is a distributed database designed to handle large volumes of structured data across clusters of commodity servers. It organizes data as an ordered key-value store and employs ACID transactions for all operations. It is especially well-suited for read/write workloads but also has excellent performance for write-intensive workloads. Users interact with the database using API language binding.
- simple & performant embedded db based on the legendary bdb
- lightning-fast b+ tree & native mmap implementation
- resilient & doesn't need tuning
- fork of LMDB with more features/enhancements
- more actively maintained
- LSM tree-like write performance with b+ tree-like read performance
- less actively maintained but looks solid
- unstable api (still in beta)
There are many deprecated python clients for rocksdb, but thankfully there's one library that seems really solid- rocksdict. It supports everything we need, including a bytes only mode. There are two main aspects tested in this exploration: how performant is rocksdict under load, and how does it handle forking from checkpoints? We explore these two concerns below. The code for these tests can be found under the explorations project in the tram repo.
This exloration aims to develop an understanding of how rocksdict performs under load. We simulate load by performing writes across a number of slots in a simulated slot series. We simulate 20k writes per slot, each one storing a random 1kB blob. This nets Approximately 20MB of random blob data being written per slot. In order to keep parity with our checkpoint usage, we create a checkpoint after each slot. After the load test, we iterate back through the checkpoints to verify none of the data was corrupted or lost. The average time per slot is around 1s. The db-internal file sizes are listed below (smaller files are kept out for brevity). In total, about 140 files were created for 10 slots, and the file sizes per slot line up approximately with our 20MB write throughput. The rocks perf exploration code can be found here
8.0K ./state/9/OPTIONS-000102
8.0K ./state/genesis/MANIFEST-000005
20K ./rocks-file-sizes.log
24K ./state/0/LOG
24K ./state/1/LOG
24K ./state/2/LOG
24K ./state/3/LOG
24K ./state/4/LOG
24K ./state/5/LOG
24K ./state/6/LOG
24K ./state/7/LOG
24K ./state/8/LOG
24K ./state/9/LOG
324K ./state/genesis/LOG
1.4M ./state/0/000015.sst
1.4M ./state/1/000024.sst
1.4M ./state/2/000033.sst
1.4M ./state/3/000042.sst
1.4M ./state/4/000051.sst
1.4M ./state/5/000060.sst
1.4M ./state/6/000069.sst
1.4M ./state/7/000078.sst
1.4M ./state/8/000087.sst
1.4M ./state/9/000096.sst
19M ./state/0/000009.sst
19M ./state/0/000011.sst
19M ./state/0/000013.sst
19M ./state/1/000018.sst
19M ./state/1/000020.sst
19M ./state/1/000022.sst
19M ./state/2/000027.sst
19M ./state/2/000029.sst
19M ./state/2/000031.sst
19M ./state/3/000036.sst
19M ./state/3/000038.sst
19M ./state/3/000040.sst
19M ./state/4/000045.sst
19M ./state/4/000047.sst
19M ./state/4/000049.sst
19M ./state/5/000054.sst
19M ./state/5/000056.sst
19M ./state/5/000058.sst
19M ./state/6/000063.sst
19M ./state/6/000065.sst
19M ./state/6/000067.sst
19M ./state/7/000072.sst
19M ./state/7/000074.sst
19M ./state/7/000076.sst
19M ./state/8/000081.sst
19M ./state/8/000083.sst
19M ./state/8/000085.sst
19M ./state/9/000090.sst
19M ./state/9/000092.sst
19M ./state/9/000094.sst
58M ./state/1/000016.sst
58M ./state/2/000025.sst
58M ./state/3/000034.sst
58M ./state/4/000043.sst
58M ./state/5/000052.sst
58M ./state/6/000061.sst
58M ./state/8/000079.sst
58M ./state/9/000088.sst
58M ./state/genesis/000097.sst
59M ./state/7/000070.sst
This exploration aims to explore how rocksdb checkpoints can be used to support our forking requirements. We need to be able to follow multiple forks, each with distinct states. For simulating this, we load a tree into rocks. The tree structure is outlined below. Arrows are numbered indicating the order of node creation. A key finding of this exploration is that rocks holds a db lock on the original db that a client was opened against. This means we can't have more than one fork sourced from any single block. As a workaround, we should be able to walk back from the locked block until we find an unlocked block, and re-process blocks until we reach & follow the fork. It's worth noting that for this to occurr in practice, it would require three validators to produce a block building on the same parent. The checkpoint exploration code can be found here.

processing benny
keys: [b'alfred', b'benny']
processing chase
keys: [b'alfred', b'benny', b'chase']
processing dian
keys: [b'alfred', b'benny', b'chase', b'dian']
processing elly
keys: [b'alfred', b'benny', b'chase', b'dian', b'elly']
processing boris
keys: [b'alfred', b'boris']
processing claire
keys: [b'alfred', b'boris', b'claire']
processing chris
keys: [b'alfred', b'boris', b'chris']
processing diaz
keys: [b'alfred', b'boris', b'chris', b'diaz']
processing cory
IO error: lock hold by current process, acquire time 1741790751 acquiring thread 8631324480: ./tree/boris/LOCK: No locks available