Skip to content

Blocked in use#17

Open
harrylin98 wants to merge 72 commits into
JimB123:jims-forklessfrom
harrylin98:blockedInUse
Open

Blocked in use#17
harrylin98 wants to merge 72 commits into
JimB123:jims-forklessfrom
harrylin98:blockedInUse

Conversation

@harrylin98
Copy link
Copy Markdown

No description provided.

harrylin98 and others added 9 commits January 14, 2026 12:28
### This PR adds GoogleTest (gtest) support to Valkey to enable writing
modern unit tests.

**Motivation**: 
GoogleTest provides richer assertions, test fixtures, mocking support,
and improved diagnostics, helping improve test coverage and
maintainability over time.

**Summary**:
The change is limited to test infrastructure, and existing C unit tests
remain unchanged.

This PR focuses only on integrating the framework and includes a small
set of example tests to demonstrate usage.

For more details, see `src/gtest/README.md`.

---------

Signed-off-by: Harry Lin <harrylhl@amazon.com>
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Signed-off-by: Harry Lin <49881386+harrylin98@users.noreply.github.com>
Co-authored-by: Harry Lin <harrylhl@amazon.com>
Co-authored-by: Jim Brunner <brunnerj@amazon.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Jacob Murphy <jkmurphy@google.com>
This PR fixes the issue where changes to src files did not trigger a
rebuild of
libvalkey.a when rerunning `make test-gtest`.

It also resolves the problem where unchanged .cpp test files were
unnecessarily recompiled
during make test-gtest.

Now:

If nothing changed:
- Nothing rebuilds, no relinking occurs.

If only C++ test files changed:
- Both libvalkey.a and C++ test files recompile (because release.o
changes)
- Test executable relinks

If only src code changed:
- Only libvalkey.a rebuilds
- C++ test files do not recompile
- Test executable relinks

If both changed:
- Both libvalkey.a and C++ test files rebuild
- Test executable relinks

Signed-off-by: Harry Lin <harrylhl@amazon.com>
Co-authored-by: Harry Lin <harrylhl@amazon.com>
Resolved conflicts by merging both gtest and libbacktrace features:
- ci.yml: Combined gtest dependencies with libbacktrace support
- Makefile: Added libbacktrace configuration alongside gtest-parallel
- daily.yml: Merged g++-multilib with libbacktrace build steps

All build configurations now support both USE_LIBBACKTRACE and gtest unit tests.
…alkey-io#3217)

## Add `ALLOW_BUSY` flag to `SELECT` command

### Motivation
When a client decides to switch databases (e.g., from DB 0 to DB 1), it
must issue `SELECT` command.
However it is possible that when the command was sent, the server was
running a very long script, causing it to respond with `-BUSY` error.
This leads to several issues:
1. pipelining clients can face inconsistency when the select is followed
by a pipeline of commands to write to this database. in case only the
select gets the `-BUSY` error and some of the rest of the commands are
process AFTER the long script executes, these commands will be processed
on the wrong database context.
2. Although clients can maintain logic to handle the prior issue, it
complicates client logic and places a "head of line" delay, since the
client will need to wait for the select return in order to continue
pipeline commands, potentially forcing it to allocate a different
connection per database. Also this is probably NOT being handled by any
known client library ATM.
3. With the introduction of multi-database support in cluster mode,
clients managing connections across multiple shards need to maintain a
consistent database selection across all their connections. When a
client decides to switch databases (e.g., from DB 0 to DB 1), it must
issue `SELECT` to every shard in the cluster.
Currently, if any shard is busy (e.g., executing a long-running Lua
script or module command), the `SELECT` call to that shard will be
rejected with a `-BUSY` error. This creates a **split-brain scenario**
where some connections have switched to the new database while others
remain on the old one.

### Why `ALLOW_BUSY` is safe for `SELECT`

`SELECT` is a **connection-local, metadata-only operation**. It simply
changes which database index the current client connection points to.
It:

- **Does not read or write any keys** - there is no data conflict with a
running script.
- **Is already flagged as `fast`** - it runs in O(1) and will not
contribute to or extend any busy condition.
- **Is already flagged as `loading`** - the server already recognizes
that `SELECT` is safe to execute during sensitive server states (dataset
loading from disk), establishing a precedent that this command is
harmless to run when other commands are being rejected.
- **Is already flagged as `stale`** - similarly, `SELECT` is allowed on
stale replicas, further confirming it is treated as a safe, non-data
operation.

### Changes

- Added `allow_busy` to the `command_flags` for `SELECT` in
`src/commands/select.json`.

Signed-off-by: Gabi Ganam <ggabi@amazon.com>
Co-authored-by: Gabi Ganam <ggabi@amazon.com>
Nikhil-Manglore and others added 17 commits February 19, 2026 20:34
The `test_quicklistCompressAndDecompressQuicklistListpackNode` unit test
was failing with ASan in CI with OOM errors. The test was allocating and
compressing up to 1GB of data (32 iterations × 32MB), which exceeded
available memory in CI environments. The test is skipped for ASan
builds.

This PR addressed the unit test failures in
`test-sanitizer-address-large-memory` for both GCC and CLANG

Resolves valkey-io#3221

---------

Signed-off-by: Nikhil Manglore <nmanglor@amazon.com>
## Overview:

This PR converted existing C unit tests to GTEST based on this
introduction of GoogleTest framework valkey-io#2956 , as mentioned in valkey-io#2878 .

## Details:
1. Kept the previous test logic as much as possible, also kept all
original comments from C unit tests.
2. Deleted C unit tests.
3. Changed headers: include "generated_wrappers.hpp" and remove all
header files already in "generated_wrappers.hpp".
4. Used extern "C" to wrap the C include files to prevent name mangling.
5. Added Test Fixture with Setup/Teardown.

## Notes:
1. `ustime` is included in `util.h`, so deleted some duplicated
`ustime`.
2. A lot of C tests include `.c` files directly to use static functions,
but we don't want to copy static functions from `.c` to `.cpp`, so I
added wrapper functions (prefix: `testOnly`) to `.c` to use in `.cpp`
file.
3. Some C tests have shared state which is incompatible with
gtest-parallel, so I changed them to isolated tests. (e.g. in
`test_dict.cpp`, I made `_dict` an instance variable created fresh in
SetUp() for each test, and added back the necessary setup code to each
test so they can run independently in any order or in parallel).


## Testing:
* GTEST: 253 tests + 32 disabled tests (285 tests in total) all pass

---------

Signed-off-by: Alina Liu <liusalisa6363@gmail.com>
Signed-off-by: Harry Lin <harrylhl@amazon.com>
Signed-off-by: Harry Lin <49881386+harrylin98@users.noreply.github.com>
Signed-off-by: Alina Liu <alinalq@dev-dsk-alinalq-2b-2db84246.us-west-2.amazon.com>
Co-authored-by: Harry Lin <harrylhl@amazon.com>
Co-authored-by: Harry Lin <49881386+harrylin98@users.noreply.github.com>
Co-authored-by: Alina Liu <alinalq@dev-dsk-alinalq-2b-2db84246.us-west-2.amazon.com>
Signed-off-by: Harry Lin <harrylhl@amazon.com>
Signed-off-by: Harry Lin <49881386+harrylin98@users.noreply.github.com>
Signed-off-by: Alina Liu <liusalisa6363@gmail.com>
Signed-off-by: Alina Liu <alinalq@dev-dsk-alinalq-2b-2db84246.us-west-2.amazon.com>
Co-authored-by: Harry Lin <harrylhl@amazon.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Alina Liu <liusalisa6363@gmail.com>
Co-authored-by: Alina Liu <alinalq@dev-dsk-alinalq-2b-2db84246.us-west-2.amazon.com>
Signed-off-by: Harry Lin <harrylhl@amazon.com>
Update deps/libvalkey to version 0.4.0

Squashed 'deps/libvalkey/' changes from b012f8e85..45c2ed15c

45c2ed15c Release 0.4.0 (valkey-io#286)
40d6590d7 Implement runtime dynamic loading for RDMA libraries (valkey-io#284)
62e757d17 Release 0.3.0 (valkey-io#283)
a554f0942 Fix potential uint32_t underflow issue (valkey-io#280)
8f9051ae0 Correcting command parser bug (valkey-io#277)
29023eb36 Add valkey-json, valkey-bloom, valkey-search to cmddef.h
ae756bc89 Update cmddef.h to Valkey 9.0.0
21abd737e Replace problematic alloca() with fixed stack alloc
38191079c Fix compilation on Solaris with Sun/Solaris Studio
ef5de0312 Make libvalkey initialization thread-safe
ae341dea5 Support slotmap updates using CLUSTER NODES in RESP3 (valkey-io#262)
36f6e2292 Fix the long-blocking read for Valkey RDMA. (valkey-io#233)
c090c28be Use a uintptr_t hop for casting pointers to ints
daa7f11ac Avoid heap buffer overflow in valkeyAsyncFormattedCommand (valkey-io#245)
15974930d Add option to select a logical database (valkey-io#244)
983d67e4f Install the macosx adapter on Apple platforms only
...

git-subtree-dir: deps/libvalkey
git-subtree-split: 45c2ed15cab9fa0ea1a6cabc8460f5eea6240de5

Signed-off-by: Björn Svensson <bjorn.a.svensson@est.tech>
…onfigured (valkey-io#2846)

When dual-channel-replication is enabled, and replica-announce-ip is
set, the RDB/AOF channel does not announce itself at this endpoint. This
defaults to the IP address behind the NAT, or the Kubernetes Pod IP in
our case.

This means that if Sentinel is polling the primary for connected
replicas, it will first see the ephemeral pod IP, then revert to the
announce-ip - leaving behind the pod IP as a down replica.

This PR configures the RDB/AOF channel to also announce itself at the
announce-ip to prevent the stale replica.

## Testing

I evaluated writing unit tests for this, but I am not sure of a way we
can test an IP address different to localhost (127.0.0.1) that would
fail without the fix. I did test on Kubernetes against 9.0 tag and
verified the fix there too.

### Status quo

On 9.0 image tag:

```
$ kubectl get pods -n valkey-baseline -o custom-columns=NAME:.metadata.name,POD-IP:.status.podIP
NAME                              POD-IP
valkey-primary-5bd78c8566-llb6k   10.244.0.25
valkey-replica-0                  10.244.0.17
valkey-replica-1                  10.244.0.13

$ kubectl get services -n valkey-baseline -o custom-columns=NAME:.metadata.name,CLUSTER-IP:.spec.clusterIP
NAME               CLUSTER-IP
valkey-primary     10.96.147.28
valkey-replica-0   10.96.66.233
valkey-replica-1   10.96.57.230
```

Logs below show that pod IP for valkey-primary-5bd78c8566-llb6k
`10.244.0.25:6379` is being used for dual-channel replication. This
should be its cluster IP `10.96.147.28` as this is what is set in
replica-announce-ip.

```
1:M 14 Nov 2025 17:57:51.750 * Replica 10.96.147.28:6379 asks for synchronization
1:M 14 Nov 2025 17:57:51.751 * Replica 10.244.0.25:6379 asks for synchronization
1:M 14 Nov 2025 17:57:56.135 * Dual channel replication: Sending to replica 10.244.0.25:6379 RDB end offset 1763269 and client-id 35
1:M 14 Nov 2025 17:57:56.140 * Replica 10.96.147.28:6379 asks for synchronization
```

### This fix

```
$ kubectl get pods -n valkey-test -o custom-columns=NAME:.metadata.name,CLUSTER-IP:.status.podIP  
NAME                              POD-IP
valkey-primary-594c9597b5-qqvdk   10.244.0.26
valkey-replica-0                  10.244.0.10
valkey-replica-1                  10.244.0.18

$ kubectl get services -n valkey-test -o custom-columns=NAME:.metadata.name,CLUSTER-IP:.spec.clusterIP
NAME               CLUSTER-IP
valkey-primary     10.96.125.142
valkey-replica     None
valkey-replica-0   10.96.155.74
valkey-replica-1   10.96.64.111
valkey-sentinel    None
```

Logs show that the Cluster IP is now being used for dual-channel
replication.

```
1:M 14 Nov 2025 17:57:49.923 * Replica 10.96.125.142:6379 asks for synchronization
1:M 14 Nov 2025 17:57:49.924 * Replica 10.96.125.142:6379 asks for synchronization
1:M 14 Nov 2025 17:57:54.913 * Dual channel replication: Sending to replica 10.96.125.142:6379 RDB end offset 1771247 and client-id 36
1:M 14 Nov 2025 17:57:54.916 * Replica 10.96.125.142:6379 asks for synchronization
```

Fixes valkey-io#2338

Signed-off-by: Joseph Heyburn <jdheyburn@gmail.com>
…ast key (valkey-io#3197)

This issue was encountered while processing valkey-io#3121. Currently in all our
commands with KSPEC_FK_KEYNUM, key step is 1. So this bug does not
currently
affect any core commands.

If we have commands with different key step values, calculting the last
key
in here will casue problems since we are not including step in the
calculation.

Signed-off-by: Binbin <binloveplay1314@qq.com>
… message

Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
Signed-off-by: harrylin98 <harrylin980107@gmail.com>
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Remove -encoding binary from fconfigure as it is no longer supported in
Tcl 9 (Fedora Rawhide and Daily / test-fedoralatest-jemalloc
(pull_request)). -translation binary alone is sufficient.
```
https://github.com/valkey-io/valkey/actions/runs/22324297258/job/64590589777?pr=3225
===== Start of server log (pid 17617) =====

### Starting server for test 
17617:M 23 Feb 2026 21:21:47.846 # WARNING Memory overcommit must be enabled! Without it, a background save or replication may fail under low memory condition. Being disabled, it can also cause failures without low memory condition, see jemalloc/jemalloc#1328. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
17617:M 23 Feb 2026 21:21:47.846 * oO0OoO0OoO0Oo Valkey is starting oO0OoO0OoO0Oo
17617:M 23 Feb 2026 21:21:47.846 * Valkey version=9.0.3, bits=64, commit=00000000, modified=0, pid=17617, just started
17617:M 23 Feb 2026 21:21:47.846 * Configuration loaded
17617:M 23 Feb 2026 21:21:47.847 * monotonic clock: POSIX clock_gettime
                .+^+.                                                
            .+#########+.                                            
        .+########+########+.           Valkey 9.0.3 (00000000/0) 64 bit
    .+########+'     '+########+.                                    
 .########+'     .+.     '+########.    Running in cluster mode
 |####+'     .+#######+.     '+####|    Port: 25121
 |###|   .+###############+.   |###|    PID: 17617                     
 |###|   |#####*'' ''*#####|   |###|                                 
 |###|   |####'  .-.  '####|   |###|                                 
 |###|   |###(  (@@@)  )###|   |###|          https://valkey.io/      
 |###|   |####.  '-'  .####|   |###|                                 
 |###|   |#####*.   .*#####|   |###|                                 
 |###|   '+#####|   |#####+'   |###|                                 
 |####+.     +##|   |#+'     .+####|                                 
 '#######+   |##|        .+########'                                 
    '+###|   |##|    .+########+'                                    
        '|   |####+########+'                                        
             +#########+'                                            
                '+v+'                                                

17617:M 23 Feb 2026 21:21:47.849 * No cluster configuration found, I'm 5501916ccdf76dee6b652cab10402cab3a8f9152
17617:M 23 Feb 2026 21:21:47.852 * Server initialized
17617:M 23 Feb 2026 21:21:47.852 * Ready to accept connections tcp
17617:M 23 Feb 2026 21:21:47.852 * Ready to accept connections unix
17617:M 23 Feb 2026 21:21:47.963 - Accepted 127.0.0.1:37791
17617:M 23 Feb 2026 21:21:47.963 - Client closed connection id=2 addr=127.0.0.1:37791 laddr=127.0.0.1:25121 fd=12 name= age=0 idle=0 flags=N capa= db=0 sub=0 psub=0 ssub=0 multi=-1 watch=0 qbuf=0 qbuf-free=20474 argv-mem=0 multi-mem=0 rbs=16384 rbp=16384 obl=0 oll=0 omem=0 tot-mem=37504 events=r cmd=ping user=default redir=-1 resp=2 lib-name= lib-ver= tot-net-in=7 tot-net-out=7 tot-cmds=1
17617:M 23 Feb 2026 21:21:47.969 - Accepted 127.0.0.1:39251
17617:M 23 Feb 2026 21:21:47.970 * configEpoch set to 1 via CLUSTER SET-CONFIG-EPOCH
17617:M 23 Feb 2026 21:21:47.975 # Missing implement of connection type tls
17617:M 23 Feb 2026 21:21:47.976 # DEBUG LOG: ========== I am primary 0 ==========
17617:M 23 Feb 2026 21:21:49.872 * Cluster state changed: ok
### Starting test Packet with missing gossip messages don't cause invalid read in tests/unit/cluster/packet.tcl
17617:M 23 Feb 2026 21:21:49.879 - Accepting cluster node connection from 127.0.0.1:33701
17617:M 23 Feb 2026 21:21:49.880 - Client closed connection id=3 addr=127.0.0.1:39251 laddr=127.0.0.1:25121 fd=12 name= age=2 idle=0 flags=N capa= db=0 sub=0 psub=0 ssub=0 multi=-1 watch=0 qbuf=0 qbuf-free=20474 argv-mem=0 multi-mem=0 rbs=1024 rbp=426 obl=0 oll=0 omem=0 tot-mem=22144 events=r cmd=cluster|info user=default redir=-1 resp=2 lib-name= lib-ver= tot-net-in=1363 tot-net-out=17324 tot-cmds=47
17617:signal-handler (1771881709) Received SIGTERM scheduling shutdown...
17617:M 23 Feb 2026 21:21:49.973 * User requested shutdown...
17617:M 23 Feb 2026 21:21:49.973 * Removing the pid file.
17617:M 23 Feb 2026 21:21:49.973 * Saving the cluster configuration file before exiting.
17617:M 23 Feb 2026 21:21:49.978 * Removing the unix socket file.
17617:M 23 Feb 2026 21:21:49.978 # Valkey is now ready to exit, bye bye...
===== End of server log (pid 17617) =====


===== Start of server stderr log (pid 17617) =====


===== End of server stderr log (pid 17617) =====

[exception]: Executing test client: unknown encoding "binary": No longer supported.
	please use either "-translation binary" or "-encoding iso8859-1".
unknown encoding "binary": No longer supported.
	please use either "-translation binary" or "-encoding iso8859-1"
    while executing
"fconfigure $sock -translation binary -encoding binary -buffering none -blocking 1"
    ("uplevel" body line 18)
    invoked from within
"uplevel 1 $code"
    (procedure "test" line 62)
    invoked from within
"test "Packet with missing gossip messages don't cause invalid read" {
        set base_port [srv 0 port]
        set cluster_port [expr {$base_port + ..."
    ("uplevel" body line 2)
    invoked from within
"uplevel 1 $code"
    (procedure "cluster_setup" line 41)
    invoked from within
"cluster_setup 1 0 1 continuous_slot_allocation default_replica_allocation {
    test "Packet with missing gossip messages don't cause invalid read" {
..."
    ("uplevel" body line 1)
    invoked from within
"uplevel 1 $code "
    (procedure "start_server" line 2)
    invoked from within
"start_server {overrides {cluster-enabled yes cluster-ping-interval 100 cluster-node-timeout 3000 cluster-databases 16 cluster-slot-stats-enabled yes} ..."
    ("uplevel" body line 1)
    invoked from within
"uplevel 1 $code"
    (procedure "start_multiple_servers" line 5)
    invoked from within
"start_multiple_servers $node_count $options $code"
    (procedure "start_cluster" line 17)
    invoked from within
"start_cluster 1 0 {tags {external:skip cluster tls:skip}} {
    test "Packet with missing gossip messages don't cause invalid read" {
        set base..."
    (file "tests/unit/cluster/packet.tcl" line 84)
    invoked from within
"source $path"
    (procedure "execute_test_file" line 4)
    invoked from within
"execute_test_file $data"
    (procedure "test_client_main" line 10)
    invoked from within
"test_client_main $::test_server_port "
Killing still running Valkey server 10700
Killing still running Valkey server 11659
Killing still running Valkey server 11705
Killing still running Valkey server 12735
Killing still running Valkey server 12775
Killing still running Valkey server 12808
Killing still running Valkey server 12858
Killing still running Valkey server 12901
Killing still running Valkey server 12940
Killing still running Valkey server 12972
Killing still running Valkey server 13004
Killing still running Valkey server 13036
Killing still running Valkey server 13073
Killing still running Valkey server 14952
Killing still running Valkey server 16631
Killing still running Valkey server 16670
Killing still running Valkey server 16686
Killing still running Valkey server 16702
Killing still running Valkey server 16718
Killing still running Valkey server 16734
Killing still running Valkey server 16753
Killing still running Valkey server 16771
Killing still running Valkey server 17094
Killing still running Valkey server 17112
Killing still running Valkey server 17133
Killing still running Valkey server 17262
Killing still running Valkey server 17278
Killing still running Valkey server 17316
Killing still running Valkey server 17349
Killing still running Valkey server 17379
Killing still running Valkey server 17410
Killing still running Valkey server 17430
Killing still running Valkey server 17443
Killing still running Valkey server 17459
Killing still running Valkey server 17478
Killing still running Valkey server 17491
Killing still running Valkey server 17507
Killing still running Valkey server 17526
Killing still running Valkey server 17545
Killing still running Valkey server 17561
Killing still running Valkey server 17577
Killing still running Valkey server 17716
Killing still running Valkey server 17778
Killing still running Valkey server 17825
Killing still running Valkey server 17848
Killing still running Valkey server 17896
Killing still running Valkey server 17931
Killing still running Valkey server 17953
Killing still running Valkey server 17973
Killing still running Valkey server 17991
Killing still running Valkey server 18010
Killing still running Valkey server 18064
Killing still running Valkey server 18082
Killing still running Valkey server 18134
Killing still running Valkey server 18118
Killing still running Valkey server 18150
Killing still running Valkey server 18166
Killing still running Valkey server 18188
Killing still running Valkey server 18209
```

Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
…3192)

When a cluster reset is performed on a replica node, a new shard ID is generated
because the node is about to become an empty primary node, see valkey-io#2283.

However, the log added in valkey-io#2510 caused some confusions. In clusterSetNodeAsPrimary
we will print:
```
serverLog(LL_NOTICE, "Reconfiguring node %.40s (%s) as primary for shard %.40s", n->name, humanNodename(n), n->shard_id);
```

In clusterReset, we first call clusterSetNodeAsPrimary and then generate a new
shard ID, which causes us to print an error shard ID log first.

There is an exmaple, when a replica node performs a cluster reset, we will print:
```
xxx * Cluster reset (user request from 'xxx').
xxx * Reconfiguring node af76a3e0ffcd77bd14fa47ce4d07ab2bdc78702f (xxx) as primary for shard ea528667634af8beed83adac2b9af8360769a1b4
```

But the node shard id is actually:
```
xxx> cluster myshardid
"52ede26d1554dd203161ba09011af14574b2cc84"
```

Now after a new shard ID is generated we will print a log, and we also move the
call to clusterSetNodeAsPrimary after the new shard id, so that we can have the
right one. After this PR:
```
xxx * Cluster reset (user request from 'xxx').
xxx * Moving myself to a new shard bd31870ce73f5977084e6a46e337a4a1ad38fc66.
xxx * Reconfiguring node 1d54b904efd30cd9d7d1abbfd63c8fafbb62e1c8 (xxx) as primary for shard bd31870ce73f5977084e6a46e337a4a1ad38fc66
```

This is part of valkey-io#2989, but i guess we won't merge the extension fix in a short
time, so i am gonna extracting it separately as a log fix (or improvement).

Signed-off-by: Binbin <binloveplay1314@qq.com>
…lkey-io#3091)

## Summary

Fixes valkey-io#2620

Skip loading expired hash fields when a non-preamble RDB is being loaded
on a primary server. Propagate `HDEL` to replicas when expired fields
are skipped.

## Changes

- Updated `rdbLoadObject` signature to accept `rdbflags` and `now`
parameters
- Added logic to skip expired hash fields during RDB load on primary
- Propagate `HDEL` to replicas when `RDBFLAGS_FEED_REPL` is set
- Updated all callers of `rdbLoadObject`
- Added unit test

---------

Signed-off-by: Hanxi Zhang <hanxizh@amazon.com>
Co-authored-by: Binbin <binloveplay1314@qq.com>
…rink rehashing (valkey-io#3175)

During hashtable shrinking, all keys are inserted into ht1.
This PR adds a mechanism to the hashtable: when there are severe hash
collisions with the new ht1 during the shrink rehashing process, the
hashtable will stop shrinking by swapping ht0 and ht1 to avoid excessive
performance degradation of the new hashtable during this period.

In extreme cases, for example, if ht0 is very large and is reduced to
only one entry, and the new ht1 is very small after the resize, the ht0
rehash process is very slow since we have a lot of empty buckets. If a
large number of elements are added into ht1 at this time, it will lead
to severe hash collisions in ht1.

During the add operation during hashtable shrinking, we check the fill
percentage of ht1. If it exceeds MAX_FILL_PERCENT_HARD, we swap ht0 and
ht1, and abort the shrinking process, and then set rehash_idx back to 0
and restart the rehash.

This PR also added a new debug hashtable-can-abort-shrink subcommand to
control this behavior.

---------

Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Daniil Kashapov <daniil.kashapov.ykt@gmail.com>
…y test in crash report (valkey-io#3029)

The memory test was commented out in valkey-io#2858 and should have been
reenabled. On further investigation I found that the server hangs during
shutdown inside the `bioDrainWorker(BIO_LAZY_FREE)` call. This causes
deadlock because the lock was acquired for shutdown but lazy free jobs
require the GIL too:

- main thread: `serverCron()` acquires GIL via `afterSleep()` then calls
`finishShutdown()`, which eventually calls our script module unload code
that calls `bioDrainWorker()`.
- bio threads: Pending lazy free jobs such as `lazyFreeEvalScripts()`
call `scriptingEngineCallFreeFunction()` which requires the GIL.

---------

Signed-off-by: Rain Valentine <rsg000@gmail.com>
Signed-off-by: harrylin98 <harrylin980107@gmail.com>
@harrylin98 harrylin98 force-pushed the blockedInUse branch 2 times, most recently from a6666f2 to 0e42a61 Compare February 24, 2026 22:19
…tablished psync test (valkey-io#3242)

Fix valkey-io#3231: Increase wait timeout for RDB load log detection in test

Signed-off-by: Hanxi Zhang <hanxizh@amazon.com>
rainsupreme and others added 9 commits March 3, 2026 15:56
…ey-io#3294)

The test "Blocking keyspace notification with pipelining hset after
hget" was recently failing intermittently with two different errors:

1. `Expected [expr {114 * 10 < 1114}]` - timing assertion failed under Valgrind
2. `Timeout waiting for blocked clients` - race condition on normal runs

The test used wall-clock timing to verify that hget (non-blocking)
completed faster than hset (blocking). This is unreliable because:
- Valgrind slows execution 10-50x, making timing ratios meaningless
- Fast systems may complete both operations in <10ms, causing ratio failures

This fix replaces timing assertions with blocked client count checks,
which directly verify the blocking mechanism rather than inferring it
from timing. The test now confirms hget's response is available before
hset blocks, then waits for the blocked client count to transition
through the expected states.

Signed-off-by: Rain Valentine <rsg000@gmail.com>
Closes valkey-io#3077

### Overview 
URI in SAN is used to represent client identities in modern mTLS
deployments where CN may be empty or deprecated. See the
valkey-io#3077 (comment)
for more details.
When `tls-auth-clients-user URI` is configured, during the TLS
handshake, the server iterates through the URIs in the client
certificate and authenticates the client as the first enabled user whose
name matches one of those URIs.

### Implementation
- Introduced a new value `URI` for `tls-auth-clients-user`
- Added new function `getCertSanUri` that:
- Extracts URI entries from the certificate's SAN extension
- Checks each URI against existing Valkey users
- Returns the first URI that matches an enabled user
- Renamed `getCertFieldByName` → `getCertSubjectFieldByName` for clarity
- Modified `tlsGetPeerUsername` to support both CN and URI
authentication modes

### Example behavior
Common setup
``` 
# client certificate X509v3 Subject Alternative Name
URI:urn:valkey:user:first, URI:urn:valkey:user:second

# valkey.conf
tls-auth-clients-user URI
hide-user-data-from-log no
```
Use case 1: multiple enabled users
```
user urn:valkey:user:first on >clientpass allcommands allkeys
user urn:valkey:user:second on >clientpass allcommands allkeys

39762:M 26 Jan 2026 22:06:25.122 - TLS: Auto-authenticated client as urn:valkey:user:first
```
Use case 2: first URI disabled, second enabled
```
user urn:valkey:user:first off >clientpass allcommands allkeys
user urn:valkey:user:second on >clientpass allcommands allkeys

39792:M 26 Jan 2026 22:07:08.006 - TLS: Auto-authenticated client as urn:valkey:user:second
```
Use case 3: all matching users disabled or no matching user
```
user urn:valkey:user:first off >clientpass allcommands allkeys
user urn:valkey:user:second off >clientpass allcommands allkeys

39812:M 26 Jan 2026 22:07:34.174 * TLS: No matching user found in certificate SAN URI fields
127.0.0.1:6379> acl whoami
"default"
127.0.0.1:6379> acl log
1)  1) "count"
    2) (integer) 1
    3) "reason"
    4) "tls-cert"
    5) "context"
    6) "toplevel"
    7) "object"
    8) ""
    9) "username"
   10) "urn:valkey:user:second"
   11) "age-seconds"
   12) "17.381"
   13) "client-info"
   14) "id=3 addr=127.0.0.1:57236 laddr=127.0.0.1:6379 fd=8 name= age=0 idle=0 flags=N capa= db=0 sub=0 psub=0 ssub=0 multi=-1 watch=0 qbuf=0 qbuf-free=0 argv-mem=0 multi-mem=0 rbs=16384 rbp=16384 obl=0 oll=0 omem=0 tot-mem=17024 events=r cmd=NULL user=default redir=-1 resp=2 lib-name= lib-ver= tot-net-in=0 tot-net-out=0 tot-cmds=0"
   15) "entry-id"
   16) (integer) 0
   17) "timestamp-created"
   18) (integer) 1771963041866
   19) "timestamp-last-updated"
   20) (integer) 1771963041866
127.0.0.1:6379>
```

---------

Signed-off-by: Yang Zhao <zymy701@gmail.com>
…o#3310)

The CodeQL workflow is currently throwing a deprecation warning
regarding use of v3.

> CodeQL Action v3 will be deprecated in December 2026. Please update
all occurrences of the CodeQL Action in your workflow files to v4.

This PR introduces the following changes:
* References to CodeQL v3 have been updated to the SHA of the latest
CodeQL release, [v4.32.5].

Signed-off-by: Kurt McKee <contactme@kurtmckee.org>
…r daily tests (valkey-io#3303)

`SSL_get0_peer_certificate()` was introduced in OpenSSL 3.0. The recent
commit 7e110ae (Support TLS authentication using SAN URI) used it in
`tlsGetPeerUser()` without a version guard, breaking builds against
`OpenSSL 1.1.x.`

Use `SSL_get_peer_certificate()` on OpenSSL < 3.0 with the corresponding
`X509_free()` since the older API increments the reference count.

Fixes build failure: implicit declaration of function
`SSL_get0_peer_certificate [-Werror=implicit-function-declaration]`

Also fixes the version mismatch for almalinux 9 daily tests.
Closes valkey-io#3304.

Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
as part of valkey-io#2508 we introduce a defrag optimization to avoid matching
the replaced defrag entry.
Even though defrag replacements does not impact the skip list order,
when scores are equal, we MUST compare elements lexicographically to
maintain correct skip list ordering.
Otherwise we might miss locating the entry.

---------

Signed-off-by: Ran Shidlansik <ranshid@amazon.com>
This PR fixes a Codecov workflow misconfiguration introduced when
upgrading codecov/codecov-action from v4 to v5 (in valkey-io#3185).
In v5, the action expects files (plural), but the workflow still used
file.

The coverage shown is 0 right now:
https://app.codecov.io/gh/valkey-io/valkey


Documentation from -
https://github.com/codecov/codecov-action/tree/v5?tab=readme-ov-file#arguments

```
The following arguments have been changed

    file (this has been deprecated in favor of files)
    plugin (this has been deprecated in favor of plugins)

```

Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
Before we added LUA as a module we had a logic to NOT register the
luaHook when the value of `busy-reply-threshold` config is set to 0.

Now we ALWAYS register the hook and in order to keep aligned with
old behavior we will let the execution of the script continue from
the interrupt hook when `busy-reply-threshold` config is set == 0.

And in value.conf, we are saying the config can be negative, but in
fact the config is minimum at 0, fix the valkey.conf as well.
```
# The default is 5 seconds. It is possible to set it to 0 or a negative value
# to disable this mechanism (uninterrupted execution)
```

It was introduced in valkey-io#2858.

Signed-off-by: Alon Arenberg <alonare@amazon.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Binbin <binloveplay1314@qq.com>
In valkey-io#3260 we try to deflake it by using a assert_range, but the upper
limit of the range is low so the test is still flaky. Increase the upper
limit to 200ms more than the expected latency, e.g. from 550 to 700 when
the expected latency is 500.

```
*** [err]: LATENCY GRAPH can output the event graph in tests/unit/latency-monitor.tcl
Expected '625' to be between to '450' and '550' (context: type source line 143 file /Users/runner/work/valkey/valkey/tests/unit/latency-monitor.tcl cmd {assert_range $high 450 550} proc ::test)
```

---------

Signed-off-by: Binbin <binloveplay1314@qq.com>
bjosv and others added 3 commits March 6, 2026 12:37
…io#3317)

Since there is some mismatch between the already installed `ar` tool on
a macOS runner
and Clang 22, installed by brew; lets use the brew installed `llvm-ar`.

Expected to fix the issue in CI job `build-macos-latest`.

---------

Signed-off-by: Björn Svensson <bjorn.a.svensson@est.tech>
Replace integer mono_ticksPerMicrosecond with fixed-point arithmetic in
x86 TSC monotonic clock for better accuracy (and performance because of
multiplication + shift instead of division).
Calibration now uses double precision to compute multiplier.

**Comparison with TSC @ 2400.5 ticks/us**

**Old Method:**
Calibration: 2400 (truncated from 2400.5)
Converting 1 second of ticks: 2,400,500,000 / 2400 = 1,000,208 us
Error: +208 us per second

**New Method:**
Calibration: sample_ticks_per_us = 2400.5 (double stores exactly in this
case)
Multiplier: 2^24 / 2400.5 = 6989.942 -> stored as 6989 in uint64_t
Converting 1 second: (2,400,500,000 * 6989) >> 24 = 999,992 us
Error: -8 us per second

---------

Signed-off-by: Daniil Kashapov <daniil.kashapov.ykt@gmail.com>
@harrylin98 harrylin98 force-pushed the blockedInUse branch 2 times, most recently from dc1510e to 829f5a6 Compare March 7, 2026 04:48
enjoy-binbin and others added 3 commits March 8, 2026 00:01
Currently after changing cluster-replica-no-failover, we will
call clusterUpdateMyselfFlags to trigger CLUSTER_TODO_SAVE_CONFIG.
But in gossip when the sender's NOFAILOVER changes, we won't
trigger the save, this cause nodes.conf to not save the latest
nofailover flag.

Signed-off-by: Binbin <binloveplay1314@qq.com>
Implemented CLUSTERSCAN command for topology-aware scanning

Unlike `SCAN` which is local to a single node, `CLUSTERSCAN` provides a
mechanism that helps clients iterate across slot boundaries and handles
`MOVED` redirections.

**Key details**

* Global cluster iteration via `fingerprint-{hashtag}-cursor`
* Scan one slot at a time
* Start the CLUSTERSCAN with 0
* SLOT argument for parallel scanning of multiple slots
* Re-use scanGenericCommand for the response

**Cursor format:** `fingerprint-{hashtag}-localcursor`
 - Fingerprint is a hash of the node's DB seed that identifies the
   current memory layout. On mismatch, scan restarts from cursor 0
   rather than returning an error.
 - Fingerprint 0 indicates a cross slot cursor (e.g., initial cursor
   or slot transition) where validation is skipped.
 - Hashtag encodes the target slot
 - Local cursor tracks position within the slot

**Usage:**

```
CLUSTERSCAN <cursor> [MATCH pattern] [COUNT count] [TYPE type] [SLOT number]
```

```
  CLUSTERSCAN 0                    # Start scanning from slot 0
  CLUSTERSCAN <cursor>             # Continue from cursor
  CLUSTERSCAN 0 SLOT 1000          # Start scanning specific slot
  CLUSTERSCAN <cursor> MATCH user:* COUNT 100
```

---------

Signed-off-by: nmvk <r@nmvk.com>
Signed-off-by: Raghav <r@nmvk.com>
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Before this change, if you built `valkey-server` (e.g. `make
valkey-server`) the LUA module was not built. With this change, the LUA
module is now a direct a dependency of `valkey-server`
- (unless `BUILD_LUA=no` is passed)

Added some colors to the Lua module & Lua lib `Makefile`s to they blend
nicely in the build output.

Signed-off-by: Eran Ifrah <eifrah@amazon.com>
curious-george-rk and others added 2 commits March 9, 2026 20:03
…time_t (valkey-io#3252)

Addresses issue valkey-io#2350

As noted in the issue, much of the expire code uses raw long long for
timestamps, which provides no semantic meaning about the unit or purpose
of the value. Valkey already defines mstime_t (milliseconds) and
ustime_t (microseconds) typedefs — this PR replaces bare long long
declarations with the appropriate typedef wherever the value represents
an expiration timestamp or time duration.

This PR only fixes a small subset in the codebase, but it is an
incremental step toward fully replacing the bare long long references.

---------

Signed-off-by: curious-george-rk <r.ebu@gmail.com>
Co-authored-by: curious-george-rk <r.ebu@gmail.com>
Signed-off-by: harrylin98 <harrylin980107@gmail.com>
@harrylin98 harrylin98 force-pushed the blockedInUse branch 2 times, most recently from c5682fb to a48c3dd Compare March 9, 2026 19:22
roshkhatri and others added 3 commits March 9, 2026 22:48
Now we will be able to add a `run-cluster-benchmark` label to run a
benchmark with cluster-mode enabled valkey-server

It will use the config
https://github.com/valkey-io/valkey/blob/unstable/.github/benchmark_configs/benchmark-config-arm.json
modified for for cluster mode with a single clustermode enabled instance of
valkey.

It uses the same single instance for the benchmark as for run-benchmark.
 
If both labels are used, they are sequential in the same concurrency group `group:
ec2-al-2023-pr-benchmarking-arm64`.

---------

Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
…ey-io#2746)

This PR introduces a new config `cluster-message-gossip-perc` which
allows an operator to modify the amount of gossip node information to be
sent per ping/pong/meet message. It can be modified dynamically (Related
to valkey-io#2291). The default value
is 10% i.e. 10% of peer node information would be gossiped along with
each ping/pong/meet packet. Users can tune this configuration, setting
the value higher allows faster information dissemination whereas setting
it lower would lead to direct PING messages if no information was
received about a node with the `server.cluster_node_timeout/2` period.

Note: the behavior for partially failed gossip nodes still remains
intact where all the `pfail` nodes are part of the message for faster
propagation of information and faster transition of `PFAIL` to `FAIL`.

---------

Signed-off-by: Harkrishn Patro <bunty.hari@gmail.com>
Signed-off-by: harrylin98 <harrylin980107@gmail.com>
harrylin98 and others added 3 commits March 9, 2026 15:29
…o#3336)

This fixes a flaky dual-channel replication integration test:
https://github.com/valkey-io/valkey/actions/runs/22810251608/job/66165776198#step:8:7701

`INFO memory` field `used_memory_overhead` and `MEMORY STATS` field
`overhead.total` can change during dual-channel sync if replica's
pending replication buffer is still changing. This is probably more
visible in slower environments.

The test now collects `INFO` and `MEMORY STATS` in a single `MULTI/EXEC`
on both the primary and replica, so the compared values come from the
same snapshot.

Passing here:
https://github.com/sarthakaggarwal97/valkey/actions/runs/22864585326/job/66327772967

Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.