Skip to content

Experiment: PebbleIndex#155

Draft
joamaki wants to merge 3 commits intomainfrom
pr/joamaki/pebble-index
Draft

Experiment: PebbleIndex#155
joamaki wants to merge 3 commits intomainfrom
pr/joamaki/pebble-index

Conversation

@joamaki
Copy link
Copy Markdown
Contributor

@joamaki joamaki commented Mar 16, 2026

This is a thought exercise for implementing support for on-disk indexes using the Pebble library. This was 100% implemented by Codex with minor input from me to come up with reasonable plan (plan-pebble-index.md) and tweaks to avoid too large changes to existing code. As such I do not consider this anywhere near production quality and it is meant solely as an experiment to gauge the feasibility of on-disk indexes.

PebbleIndex can be used either as the primary or as a secondary index. PebbleIndex is always
unique. To restore tables with pebble indexes db.Restore must be called before the database is used.

Benchmarks with pebble.NoSync committing of the batch:

BenchmarkDB_WriteTxn_1-8                         1114802              1075 ns/op            930177 objects/sec      1064 B/op         17 allocs/op
BenchmarkDB_WriteTxn_10-8                        2676404               471.6 ns/op         2120280 objects/sec       526 B/op          8 allocs/op
BenchmarkDB_WriteTxn_100-8                       3023170               399.4 ns/op         2503613 objects/sec       537 B/op          7 allocs/op
BenchmarkDB_WriteTxn_100_Pebble-8                 395737              2855 ns/op            350215 objects/sec      1279 B/op         24 allocs/op
BenchmarkDB_RandomLookup-8                         33684             34905 ns/op          28649147 objects/sec         0 B/op          0 allocs/op
BenchmarkDB_RandomLookup_Pebble-8                   1236            956300 ns/op           1045697 objects/sec    512791 B/op       9001 allocs/op
BenchmarkDB_SequentialLookup-8                     44026             27245 ns/op          36703490 objects/sec         0 B/op          0 allocs/op
BenchmarkDB_SequentialLookup_Pebble-8               1278            936724 ns/op           1067550 objects/sec    512878 B/op       9001 allocs/op

With pebble.Sync for actually durable commits:

BenchmarkDB_WriteTxn_100_Pebble-8                  19905             61252 ns/op             16326 objects/sec      1291 B/op         24 allocs/op

To be considered:

  • Export tableIndex etc. interfaces and implement this as a separate go module so we don't expose these extra dependencies for all users. Depends on Proposal: Split into multiple Go modules #154. Pebble is about ~160k lines and it is not imported by cilium/cilium yet.
  • This is currently more tightly coupled into StateDB core than I like (WithPebble, Restore, changes to Commit path, Err*Pebble*, ...)
  • Using this for persisting objects received from k8s api-server and avoiding full re-listing on restart may not be worth it. The default "historical record" is 5 minutes (https://kubernetes.io/docs/reference/using-api/api-concepts/#efficient-detection-of-changes), so if the agent restart takes more than that before it tries to Watch again to continue syncing a StateDB table then it'll need to re-list and throw the old data away. Watch may also return objects out-of-order in regards to the ResourceVersion so likely need to separately (from PebbleIndex) persist what the last highest seen ResourceVersion was. Or do a full scan of the table to find highest ResourceVersion, although ResourceVersion is supposed to be an opaque string and is not considered comparable.
  • Likely some of the use-cases for pkg/wal could be replaced with this. A new 160k line dependency for niche uses is probably not worth it for cilium/cilium at this time though. If we want persistence but always with in-memory indexing then we could consider instead having hooks to use something like pkg/wal with the tables. The benefit of PebbleIndex is that we wouldn't actually need to hold all objects in memory (provided all indexes are PebbleIndex, including the internal RevisionIndex or that the in-memory indexes just hold primary keys instead of objects and we look up object from disk via the PebbleIndex).
  • The main issue with persistence is the need for stable formats. With Cilium we can somewhat side step that for some use-cases by persisting in a way that only survives restarts of the same Cilium version and always start from scratch on upgrade. That however then makes it unsuitable for cases where we need the persistence to be able to e.g. clean up stale kernel state we left from earlier. If we go with persistence for objects then we really should do e.g. protobuf with checks for backwards compatibility in place and also think about how to orchestrate migration. A can of worms we might not want to open.

This implements support for on-disk indexes using the Pebble library.

Benchmarks with `pebble.NoSync` committing of the batch (with `Sync` it's about 15k/s):

BenchmarkDB_WriteTxn_1-8                         1114802              1075 ns/op            930177 objects/sec      1064 B/op         17 allocs/op
BenchmarkDB_WriteTxn_10-8                        2676404               471.6 ns/op         2120280 objects/sec       526 B/op          8 allocs/op
BenchmarkDB_WriteTxn_100-8                       3023170               399.4 ns/op         2503613 objects/sec       537 B/op          7 allocs/op
BenchmarkDB_WriteTxn_100_Pebble-8                 395737              2855 ns/op            350215 objects/sec      1279 B/op         24 allocs/op
BenchmarkDB_RandomLookup-8                         33684             34905 ns/op          28649147 objects/sec         0 B/op          0 allocs/op
BenchmarkDB_RandomLookup_Pebble-8                   1236            956300 ns/op           1045697 objects/sec    512791 B/op       9001 allocs/op
BenchmarkDB_SequentialLookup-8                     44026             27245 ns/op          36703490 objects/sec         0 B/op          0 allocs/op
BenchmarkDB_SequentialLookup_Pebble-8               1278            936724 ns/op           1067550 objects/sec    512878 B/op       9001 allocs/op

To be considered:
- Export `tableIndex` etc. interfaces and implement this as a separate go module so we don't expose these extra
  dependencies for all users.

Signed-off-by: Jussi Maki <jussi.maki@isovalent.com>
@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 16, 2026

$ make
go build ./...
go: downloading go.yaml.in/yaml/v3 v3.0.3
go: downloading github.com/cilium/hive v1.0.0
go: downloading golang.org/x/time v0.5.0
go: downloading github.com/spf13/cobra v1.8.0
go: downloading github.com/spf13/pflag v1.0.5
go: downloading github.com/cilium/stream v0.0.0-20240209152734-a0792b51812d
go: downloading github.com/cockroachdb/pebble v1.1.5
go: downloading github.com/liggitt/tabwriter v0.0.0-20181228230101-89fcab3d43de
go: downloading github.com/spf13/viper v1.18.2
go: downloading go.uber.org/dig v1.17.1
go: downloading golang.org/x/term v0.16.0
go: downloading github.com/davecgh/go-spew v1.1.2-0.20180830191138-d8f796af33cc
go: downloading github.com/mitchellh/mapstructure v1.5.0
go: downloading golang.org/x/sys v0.18.0
go: downloading golang.org/x/tools v0.17.0
go: downloading github.com/fsnotify/fsnotify v1.7.0
go: downloading github.com/sagikazarmark/slog-shim v0.1.0
go: downloading github.com/spf13/afero v1.11.0
go: downloading github.com/spf13/cast v1.6.0
go: downloading github.com/subosito/gotenv v1.6.0
go: downloading github.com/hashicorp/hcl v1.0.0
go: downloading gopkg.in/ini.v1 v1.67.0
go: downloading github.com/magiconair/properties v1.8.7
go: downloading github.com/pelletier/go-toml/v2 v2.1.0
go: downloading gopkg.in/yaml.v3 v3.0.1
go: downloading golang.org/x/text v0.14.0
go: downloading github.com/cockroachdb/errors v1.11.3
go: downloading github.com/cockroachdb/fifo v0.0.0-20240606204812-0bbfbd93a7ce
go: downloading github.com/cockroachdb/redact v1.1.5
go: downloading github.com/cockroachdb/tokenbucket v0.0.0-20230807174530-cc333fc44b06
go: downloading github.com/prometheus/client_golang v1.15.0
go: downloading golang.org/x/exp v0.0.0-20240119083558-1b970713d09a
go: downloading github.com/pkg/errors v0.9.1
go: downloading github.com/DataDog/zstd v1.4.5
go: downloading github.com/cespare/xxhash/v2 v2.2.0
go: downloading github.com/golang/snappy v0.0.4
go: downloading github.com/cockroachdb/logtags v0.0.0-20230118201751-21c54148d20b
go: downloading github.com/getsentry/sentry-go v0.27.0
go: downloading github.com/beorn7/perks v1.0.1
go: downloading github.com/prometheus/client_model v0.3.0
go: downloading github.com/prometheus/common v0.42.0
go: downloading github.com/prometheus/procfs v0.9.0
go: downloading google.golang.org/protobuf v1.33.0
go: downloading github.com/gogo/protobuf v1.3.2
go: downloading github.com/kr/pretty v0.3.1
go: downloading github.com/golang/protobuf v1.5.3
go: downloading github.com/matttproud/golang_protobuf_extensions v1.0.4
go: downloading github.com/kr/text v0.2.0
go: downloading github.com/rogpeppe/go-internal v1.11.0
STATEDB_VALIDATE=1 go test ./... -cover -vet=all -test.count 1
go: downloading github.com/stretchr/testify v1.9.0
go: downloading go.uber.org/goleak v1.3.0
go: downloading github.com/pmezard/go-difflib v1.0.1-0.20181226105442-5d4384ee4fb2
ok  	github.com/cilium/statedb	404.173s	coverage: 76.1% of statements
ok  	github.com/cilium/statedb/index	0.006s	coverage: 33.7% of statements
ok  	github.com/cilium/statedb/internal	0.017s	coverage: 42.9% of statements
ok  	github.com/cilium/statedb/lpm	4.576s	coverage: 77.9% of statements
ok  	github.com/cilium/statedb/part	66.781s	coverage: 87.3% of statements
ok  	github.com/cilium/statedb/reconciler	0.213s	coverage: 91.9% of statements
	github.com/cilium/statedb/reconciler/benchmark		coverage: 0.0% of statements
	github.com/cilium/statedb/reconciler/example		coverage: 0.0% of statements
go test -race ./... -test.count 1
ok  	github.com/cilium/statedb	36.899s
ok  	github.com/cilium/statedb/index	1.017s
ok  	github.com/cilium/statedb/internal	1.028s
ok  	github.com/cilium/statedb/lpm	2.861s
ok  	github.com/cilium/statedb/part	35.939s
ok  	github.com/cilium/statedb/reconciler	1.390s
?   	github.com/cilium/statedb/reconciler/benchmark	[no test files]
?   	github.com/cilium/statedb/reconciler/example	[no test files]
go test ./... -bench . -benchmem -test.run xxx
goos: linux
goarch: amd64
pkg: github.com/cilium/statedb
cpu: AMD EPYC 7763 64-Core Processor                
BenchmarkDB_WriteTxn_1-4                      	  654338	      1714 ns/op	    583404 objects/sec	    1064 B/op	      17 allocs/op
BenchmarkDB_WriteTxn_10-4                     	 1621365	       737.5 ns/op	   1355952 objects/sec	     526 B/op	       8 allocs/op
BenchmarkDB_WriteTxn_100-4                    	 1877116	       719.7 ns/op	   1389516 objects/sec	     537 B/op	       7 allocs/op
BenchmarkDB_WriteTxn_100_Pebble-4             	  133813	      8919 ns/op	    112115 objects/sec	    1278 B/op	      24 allocs/op
BenchmarkDB_WriteTxn_1000-4                   	 1583676	       762.4 ns/op	   1311595 objects/sec	     522 B/op	       7 allocs/op
BenchmarkDB_WriteTxn_100_SecondaryIndex-4     	  757837	      1494 ns/op	    669356 objects/sec	    1101 B/op	      20 allocs/op
BenchmarkDB_WriteTxn_CommitOnly_100Tables-4   	  865263	      1323 ns/op	    1112 B/op	       5 allocs/op
BenchmarkDB_WriteTxn_CommitOnly_1Table-4      	 1595713	       751.8 ns/op	     224 B/op	       5 allocs/op
BenchmarkDB_NewWriteTxn-4                     	 1835118	       653.1 ns/op	     200 B/op	       4 allocs/op
BenchmarkDB_WriteTxnCommit100-4               	  930816	      1309 ns/op	    1096 B/op	       5 allocs/op
BenchmarkDB_NewReadTxn-4                      	549719881	         2.187 ns/op	       0 B/op	       0 allocs/op
BenchmarkDB_Modify-4                          	    1383	    838211 ns/op	   1193017 objects/sec	  546201 B/op	    8095 allocs/op
BenchmarkDB_GetInsert-4                       	    1332	    894528 ns/op	   1117908 objects/sec	  530172 B/op	    8095 allocs/op
BenchmarkDB_RandomInsert-4                    	    1568	    770287 ns/op	   1298217 objects/sec	  522119 B/op	    7095 allocs/op
BenchmarkDB_RandomReplace-4                   	     457	   2610829 ns/op	    383020 objects/sec	 2073441 B/op	   29147 allocs/op
BenchmarkDB_SequentialInsert-4                	    1533	    761746 ns/op	   1312774 objects/sec	  522121 B/op	    7095 allocs/op
BenchmarkDB_SequentialInsert_Prefix-4         	     426	   2835621 ns/op	    352656 objects/sec	 3564807 B/op	   45542 allocs/op
BenchmarkDB_Changes_Baseline-4                	    1377	    871057 ns/op	   1148031 objects/sec	  582358 B/op	    9187 allocs/op
BenchmarkDB_Changes-4                         	     822	   1461904 ns/op	    684039 objects/sec	  783912 B/op	   12339 allocs/op
BenchmarkDB_RandomLookup-4                    	   21722	     54822 ns/op	  18240786 objects/sec	       0 B/op	       0 allocs/op
BenchmarkDB_RandomLookup_Pebble-4             	     790	   1490695 ns/op	    670828 objects/sec	  512372 B/op	    9001 allocs/op
BenchmarkDB_SequentialLookup-4                	   26359	     45524 ns/op	  21966312 objects/sec	       0 B/op	       0 allocs/op
BenchmarkDB_SequentialLookup_Pebble-4         	     805	   1484670 ns/op	    673550 objects/sec	  512435 B/op	    9001 allocs/op
BenchmarkDB_Prefix_SecondaryIndex-4           	    6367	    182070 ns/op	   5492407 objects/sec	  124920 B/op	    1025 allocs/op
BenchmarkDB_FullIteration_All-4               	     979	   1230629 ns/op	  81259255 objects/sec	     104 B/op	       4 allocs/op
BenchmarkDB_FullIteration_Prefix-4            	     790	   1604146 ns/op	  62338485 objects/sec	     136 B/op	       5 allocs/op
BenchmarkDB_FullIteration_Get-4               	     206	   5763986 ns/op	  17349105 objects/sec	       0 B/op	       0 allocs/op
BenchmarkDB_FullIteration_Get_Secondary-4     	     108	  10923903 ns/op	   9154237 objects/sec	       0 B/op	       0 allocs/op
BenchmarkDB_FullIteration_ReadTxnGet-4        	     216	   5563518 ns/op	  17974238 objects/sec	       0 B/op	       0 allocs/op
BenchmarkDB_PropagationDelay-4                	  557546	      2038 ns/op	        15.00 50th_µs	        21.00 90th_µs	       136.0 99th_µs	    1134 B/op	      20 allocs/op
BenchmarkDB_WriteTxn_100_LPMIndex-4           	  488535	      2433 ns/op	    410949 objects/sec	    1826 B/op	      37 allocs/op
BenchmarkDB_WriteTxn_1_LPMIndex-4             	  127623	     14394 ns/op	     69475 objects/sec	   15843 B/op	      82 allocs/op
BenchmarkDB_LPMIndex_Get-4                    	     398	   3040163 ns/op	   3289297 objects/sec	       0 B/op	       0 allocs/op
BenchmarkWatchSet_4-4                         	 2072787	       570.7 ns/op	     296 B/op	       4 allocs/op
BenchmarkWatchSet_16-4                        	  662799	      1792 ns/op	    1096 B/op	       5 allocs/op
BenchmarkWatchSet_128-4                       	   77611	     15536 ns/op	    8904 B/op	       5 allocs/op
BenchmarkWatchSet_1024-4                      	    7832	    154434 ns/op	   73744 B/op	       5 allocs/op
PASS
ok  	github.com/cilium/statedb	49.130s
PASS
ok  	github.com/cilium/statedb/index	0.004s
goos: linux
goarch: amd64
pkg: github.com/cilium/statedb/internal
cpu: AMD EPYC 7763 64-Core Processor                
Benchmark_SortableMutex-4   	 6201354	       192.7 ns/op	       0 B/op	       0 allocs/op
PASS
ok  	github.com/cilium/statedb/internal	1.200s
goos: linux
goarch: amd64
pkg: github.com/cilium/statedb/lpm
cpu: AMD EPYC 7763 64-Core Processor                
Benchmark_txn_insert/batchSize=1-4         	    1872	    637338 ns/op	   1569025 objects/sec	  838409 B/op	   13975 allocs/op
Benchmark_txn_insert/batchSize=10-4        	    3032	    394090 ns/op	   2537491 objects/sec	  385195 B/op	    6668 allocs/op
Benchmark_txn_insert/batchSize=100-4       	    3231	    370148 ns/op	   2701625 objects/sec	  345613 B/op	    6027 allocs/op
Benchmark_txn_delete/batchSize=1-4         	    1573	    784234 ns/op	   1275130 objects/sec	 1286471 B/op	   13976 allocs/op
Benchmark_txn_delete/batchSize=10-4        	    2973	    387196 ns/op	   2582671 objects/sec	  372417 B/op	    5769 allocs/op
Benchmark_txn_delete/batchSize=100-4       	    3458	    347768 ns/op	   2875477 objects/sec	  286753 B/op	    5038 allocs/op
Benchmark_LPM_Lookup-4                     	    8014	    149238 ns/op	   6700711 objects/sec	       0 B/op	       0 allocs/op
Benchmark_LPM_All-4                        	  143037	      8922 ns/op	 112080851 objects/sec	      32 B/op	       1 allocs/op
Benchmark_LPM_Prefix-4                     	  136282	      9175 ns/op	 108987228 objects/sec	      32 B/op	       1 allocs/op
Benchmark_LPM_LowerBound-4                 	  256156	      4774 ns/op	 104737269 objects/sec	     288 B/op	       2 allocs/op
PASS
ok  	github.com/cilium/statedb/lpm	12.130s
goos: linux
goarch: amd64
pkg: github.com/cilium/statedb/part
cpu: AMD EPYC 7763 64-Core Processor                
Benchmark_Uint64Map_Random-4                  	    1594	    750310 ns/op	   1332783 items/sec	 2525292 B/op	    6024 allocs/op
Benchmark_Uint64Map_Sequential-4              	    1929	    631103 ns/op	   1584527 items/sec	 2216722 B/op	    5754 allocs/op
Benchmark_Uint64Map_Sequential_Insert-4       	    2109	    594991 ns/op	   1680697 items/sec	 2208717 B/op	    4753 allocs/op
Benchmark_Uint64Map_Sequential_Txn_Insert-4   	   10000	    104222 ns/op	   9594860 items/sec	   86352 B/op	    2028 allocs/op
Benchmark_Uint64Map_Random_Insert-4           	    1771	    685116 ns/op	   1459606 items/sec	 2517723 B/op	    5029 allocs/op
Benchmark_Uint64Map_Random_Txn_Insert-4       	    7075	    167553 ns/op	   5968277 items/sec	  119815 B/op	    2418 allocs/op
Benchmark_Insert_RootOnlyWatch-4              	    9547	    110844 ns/op	   9021723 objects/sec	   71504 B/op	    2033 allocs/op
Benchmark_Insert-4                            	    7424	    158828 ns/op	   6296127 objects/sec	  186937 B/op	    3060 allocs/op
Benchmark_Modify-4                            	   12657	     94740 ns/op	  10555243 objects/sec	   58224 B/op	    1007 allocs/op
Benchmark_GetInsert-4                         	    9579	    123571 ns/op	   8092501 objects/sec	   58224 B/op	    1007 allocs/op
Benchmark_Replace-4                           	31612095	        37.48 ns/op	  26680404 objects/sec	       0 B/op	       0 allocs/op
Benchmark_Replace_RootOnlyWatch-4             	32013414	        37.40 ns/op	  26741339 objects/sec	       0 B/op	       0 allocs/op
Benchmark_txn_1-4                             	 6226521	       191.4 ns/op	   5224718 objects/sec	     168 B/op	       3 allocs/op
Benchmark_txn_10-4                            	10359091	       114.7 ns/op	   8717758 objects/sec	      86 B/op	       2 allocs/op
Benchmark_txn_100-4                           	12027903	        98.65 ns/op	  10137183 objects/sec	      80 B/op	       2 allocs/op
Benchmark_txn_1000-4                          	10654519	       111.1 ns/op	   9001134 objects/sec	      65 B/op	       2 allocs/op
Benchmark_txn_delete_1-4                      	 5001459	       246.1 ns/op	   4064024 objects/sec	     664 B/op	       4 allocs/op
Benchmark_txn_delete_10-4                     	11026455	       108.5 ns/op	   9218395 objects/sec	     106 B/op	       1 allocs/op
Benchmark_txn_delete_100-4                    	12343825	        96.21 ns/op	  10393835 objects/sec	      47 B/op	       1 allocs/op
Benchmark_txn_delete_1000-4                   	14895100	        80.88 ns/op	  12364387 objects/sec	      24 B/op	       1 allocs/op
Benchmark_Get-4                               	   45277	     26467 ns/op	  37782899 objects/sec	       0 B/op	       0 allocs/op
Benchmark_All-4                               	  122899	     10399 ns/op	  96162402 objects/sec	       0 B/op	       0 allocs/op
Benchmark_Iterator_All-4                      	  153076	      9555 ns/op	 104652795 objects/sec	       0 B/op	       0 allocs/op
Benchmark_Iterator_Next-4                     	  160516	      7465 ns/op	 133963675 objects/sec	     896 B/op	       1 allocs/op
Benchmark_Hashmap_Insert-4                    	   15088	     79498 ns/op	  12578968 objects/sec	   74264 B/op	      20 allocs/op
Benchmark_Hashmap_Get_Uint64-4                	  132286	      9089 ns/op	 110028251 objects/sec	       0 B/op	       0 allocs/op
Benchmark_Hashmap_Get_Bytes-4                 	  109214	     10986 ns/op	  91023694 objects/sec	       0 B/op	       0 allocs/op
Benchmark_Delete_Random-4                     	      84	  14374996 ns/op	   6956524 objects/sec	 2111868 B/op	  102363 allocs/op
Benchmark_find16-4                            	202104613	         5.933 ns/op	       0 B/op	       0 allocs/op
Benchmark_findIndex16-4                       	100000000	        12.41 ns/op	       0 B/op	       0 allocs/op
Benchmark_find48-4                            	385225512	         3.115 ns/op	       0 B/op	       0 allocs/op
Benchmark_findIndex48_hit-4                   	427897045	         2.803 ns/op	       0 B/op	       0 allocs/op
Benchmark_findIndex48_miss-4                  	384239005	         3.120 ns/op	       0 B/op	       0 allocs/op
Benchmark_find4-4                             	424181649	         2.829 ns/op	       0 B/op	       0 allocs/op
Benchmark_findIndex4-4                        	320768385	         3.752 ns/op	       0 B/op	       0 allocs/op
PASS
ok  	github.com/cilium/statedb/part	43.392s
PASS
ok  	github.com/cilium/statedb/reconciler	0.007s
?   	github.com/cilium/statedb/reconciler/benchmark	[no test files]
?   	github.com/cilium/statedb/reconciler/example	[no test files]
go run ./reconciler/benchmark -quiet
1000000 objects reconciled in 1.99 seconds (batch size 1000)
Throughput 502054.27 objects per second
888MB total allocated, 6013717 in-use objects, 338MB bytes in use

joamaki added 2 commits March 16, 2026 12:08
Signed-off-by: Jussi Maki <jussi.maki@isovalent.com>
Signed-off-by: Jussi Maki <jussi.maki@isovalent.com>
@joamaki joamaki changed the title Proposal: PebbleIndex Experiment: PebbleIndex Mar 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant