Explorations

This page houses various architecture explorations.

DB

We need to settle on a data architecture and database engine. Our most complex requirement is that we must be able to reconstruct the state for any given timeslot in the past 24 hours, and track multiple states across several forks.

PVM Invocations by Activity

Activity is defined here as a role that our validator plays in the network. During normal validation operations, we can expect to play all of the following roles defined below. All of these roles span both the conductor & pvm, although the contexts in which the pvm is invoked will vary. The invocation associations are detailed in the table below.

Producing blocks doesn't use accumulate or on-transfer contexts, since we use the same importing blocks capability regardless of if we produced the block

Activity	is-authorized	refinement	accumulate	on-transfer
importing blocks			X	X
producing blocks	X	X
refining	X	X
auditing	X	X

Data Access by Activity

State Item	importing blocks	producing blocks	refining	auditing
recent block details	rw	r		r
safrole	rw	r
service accounts	rw	r	r	r
entropy	rw	r		r
future validators	rw
current validators	rw
previous validators	rw
pending reports	rw	r
auth pool	rw	r	r	r
slot	rw	r
auth queue	rw	r
disputes	rw
privileged services	rw	r
validator activity stats
accumulation queue	rw	r
accumulation history	rw	r		r
root	rw	r		r
reporters	rw
available reports	rw	r
accumulation root	rw

State Accessor

We can use a state accessor as a sidecar to our activity processes. The accessor provides a way to read & write to state from our DB (and fallback to reading from one of our DA systems on a db-miss). When we receive an item from a DA sysem, we write the item to the db after passing it back to the caller. Since updates to state happen as part of block importing, we should be able to avoid race conditions by using the timeslot state as a locking mechanism.

Architecture

DB Comparison

Our DB must be fast, have excellent support for both python & rust clients, be battle tested, and expected to stand the test of time. Ideally it will support some form of delta-based versioning as a first-class citizen, which would significantly reduce the overhead required to support following multiple forks. Below is a comparison of our top db choices, along with a ranking for which ones we will pursue for the state accessor poc.

DB	rollback/versioning?	design	language	contributors (non-trivial)	gh-stars	ranking
rocks	checkpoints	lsm	C++	12	29k	1
redis	snapshot & AOF	in-mem structs	C	12	68k	2
dragonfly	snapshot but no AOF	in-mem structs	C++	11	27k	NA
foundation	backup & snapshot	b-tree	C++	20	15k	NA
lmdb	copy file	b-tree	C	?	NA	?
libmdbx	copy file	b-tree	C/C++	2	1k	?
sled	?	b-tree	Rust	3	8k	?

Rocks

This code is a library that forms the core building block for a fast key-value server, especially suited for storing data on flash drives. It has a Log-Structured-Merge-Database (LSM) design with flexible tradeoffs between Write-Amplification-Factor (WAF), Read-Amplification-Factor (RAF) and Space-Amplification-Factor (SAF). It has multi-threaded compactions, making it especially suitable for storing multiple terabytes of data in a single database.

Pebble

pebble was evaluated but not considered after some initial reseach. It's missing the checkpoints capability that rocksdb offers. It also seems opinionated towards cockroachdbs needs.

Level

leveldb was evaluated but not considered since it's "receiving very limited maintenance"

LevelDB is a fast key-value storage library written at Google that provides an ordered mapping from string keys to string values. This repository is receiving very limited maintenance. We will only review the following types of changes

Redis

redis seems like a natural choice, but its AOF is limited, and has more overhead when compared to rocks snapshotting capabilities... LSM naturally enables more advanced snapshot abilities.

Redis is often referred to as a data structures server. What this means is that Redis provides access to mutable data structures via a set of commands, which are sent using a server-client model with TCP sockets and a simple protocol. So different processes can query and modify the same data structures in a shared way.

Dragonfly

Faster than Redis when using multiple cores, but doesn't support AOF.

Fully compatible with Redis and Memcached APIs, Dragonfly requires no code changes to adopt. Compared to legacy in-memory datastores, Dragonfly delivers 25X more throughput, higher cache hit rates with lower tail latency, and can run on up to 80% less resources for the same sized workload.

Foundation

Foundations b-tree architecture seems like it would add significant overhead compared to LSMs. It also doesn't support delta-based snapshotting.

FoundationDB is a distributed database designed to handle large volumes of structured data across clusters of commodity servers. It organizes data as an ordered key-value store and employs ACID transactions for all operations. It is especially well-suited for read/write workloads but also has excellent performance for write-intensive workloads. Users interact with the database using API language binding.

LMDB with heed

simple & performant embedded db based on the legendary bdb
lightning-fast b+ tree & native mmap implementation
resilient & doesn't need tuning

libmdbx

fork of LMDB with more features/enhancements
more actively maintained

sled

LSM tree-like write performance with b+ tree-like read performance
less actively maintained but looks solid
unstable api (still in beta)

RocksDB POC

There are many deprecated python clients for rocksdb, but thankfully there's one library that seems really solid- rocksdict. It supports everything we need, including a bytes only mode. There are two main aspects tested in this exploration: how performant is rocksdict under load, and how does it handle forking from checkpoints? We explore these two concerns below. The code for these tests can be found under the explorations project in the tram repo.

Performance

This exloration aims to develop an understanding of how rocksdict performs under load. We simulate load by performing writes across a number of slots in a simulated slot series. We simulate 20k writes per slot, each one storing a random 1kB blob. This nets Approximately 20MB of random blob data being written per slot. In order to keep parity with our checkpoint usage, we create a checkpoint after each slot. After the load test, we iterate back through the checkpoints to verify none of the data was corrupted or lost. The average time per slot is around 1s. The db-internal file sizes are listed below (smaller files are kept out for brevity). In total, about 140 files were created for 10 slots, and the file sizes per slot line up approximately with our 20MB write throughput. The rocks perf exploration code can be found here

8.0K	./state/9/OPTIONS-000102
8.0K	./state/genesis/MANIFEST-000005
20K	./rocks-file-sizes.log
24K	./state/0/LOG
24K	./state/1/LOG
24K	./state/2/LOG
24K	./state/3/LOG
24K	./state/4/LOG
24K	./state/5/LOG
24K	./state/6/LOG
24K	./state/7/LOG
24K	./state/8/LOG
24K	./state/9/LOG
324K	./state/genesis/LOG
1.4M	./state/0/000015.sst
1.4M	./state/1/000024.sst
1.4M	./state/2/000033.sst
1.4M	./state/3/000042.sst
1.4M	./state/4/000051.sst
1.4M	./state/5/000060.sst
1.4M	./state/6/000069.sst
1.4M	./state/7/000078.sst
1.4M	./state/8/000087.sst
1.4M	./state/9/000096.sst
19M	./state/0/000009.sst
19M	./state/0/000011.sst
19M	./state/0/000013.sst
19M	./state/1/000018.sst
19M	./state/1/000020.sst
19M	./state/1/000022.sst
19M	./state/2/000027.sst
19M	./state/2/000029.sst
19M	./state/2/000031.sst
19M	./state/3/000036.sst
19M	./state/3/000038.sst
19M	./state/3/000040.sst
19M	./state/4/000045.sst
19M	./state/4/000047.sst
19M	./state/4/000049.sst
19M	./state/5/000054.sst
19M	./state/5/000056.sst
19M	./state/5/000058.sst
19M	./state/6/000063.sst
19M	./state/6/000065.sst
19M	./state/6/000067.sst
19M	./state/7/000072.sst
19M	./state/7/000074.sst
19M	./state/7/000076.sst
19M	./state/8/000081.sst
19M	./state/8/000083.sst
19M	./state/8/000085.sst
19M	./state/9/000090.sst
19M	./state/9/000092.sst
19M	./state/9/000094.sst
58M	./state/1/000016.sst
58M	./state/2/000025.sst
58M	./state/3/000034.sst
58M	./state/4/000043.sst
58M	./state/5/000052.sst
58M	./state/6/000061.sst
58M	./state/8/000079.sst
58M	./state/9/000088.sst
58M	./state/genesis/000097.sst
59M	./state/7/000070.sst

Checkpoints

This exploration aims to explore how rocksdb checkpoints can be used to support our forking requirements. We need to be able to follow multiple forks, each with distinct states. For simulating this, we load a tree into rocks. The tree structure is outlined below. Arrows are numbered indicating the order of node creation. A key finding of this exploration is that rocks holds a db lock on the original db that a client was opened against. This means we can't have more than one fork sourced from any single block. As a workaround, we should be able to walk back from the locked block until we find an unlocked block, and re-process blocks until we reach & follow the fork. It's worth noting that for this to occurr in practice, it would require three validators to produce a block building on the same parent. The checkpoint exploration code can be found here.

processing benny
keys: [b'alfred', b'benny']

processing chase
keys: [b'alfred', b'benny', b'chase']

processing dian
keys: [b'alfred', b'benny', b'chase', b'dian']

processing elly
keys: [b'alfred', b'benny', b'chase', b'dian', b'elly']

processing boris
keys: [b'alfred', b'boris']

processing claire
keys: [b'alfred', b'boris', b'claire']

processing chris
keys: [b'alfred', b'boris', b'chris']

processing diaz
keys: [b'alfred', b'boris', b'chris', b'diaz']

processing cory
IO error: lock hold by current process, acquire time 1741790751 acquiring thread 8631324480: ./tree/boris/LOCK: No locks available

Contents

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explorations

DB

PVM Invocations by Activity

Data Access by Activity

State Accessor

Architecture

DB Comparison

Rocks

Pebble

Level

Redis

Dragonfly

Foundation

LMDB with heed

libmdbx

sled

RocksDB POC

Performance

Checkpoints

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally