Skip to content

feat: tasks 9-12 — mndSnapSend module, vnodeGetSnapSendProgress, dnode handler, mndMain/Show registration#35306

Merged
guanshengliang merged 11 commits into
3.3.6from
feature/snap-send-progress
May 13, 2026
Merged

feat: tasks 9-12 — mndSnapSend module, vnodeGetSnapSendProgress, dnode handler, mndMain/Show registration#35306
guanshengliang merged 11 commits into
3.3.6from
feature/snap-send-progress

Conversation

@xiao-77
Copy link
Copy Markdown
Contributor

@xiao-77 xiao-77 commented May 11, 2026

  • mndSnapSend.h/c: mnode snap-send progress cache (pHash), timer pullup, retrieve handlers for ins_snap_send_vnodes/ins_snap_send_filesets
  • vnodeSnapshot.c: vnodeGetSnapSendProgress() copies tsdb progress under read-lock
  • vnodeInt.h: declare vnodeGetSnapSendProgress()
  • vmHandle.c: vmProcessDnodeQuerySnapSendProgressReq() iterates vnodes, builds SDnodeQuerySnapSendProgressRsp, registers TDMT_DND_QUERY_SNAP_SEND_PROGRESS
  • vmWorker.c: dispatch TDMT_DND_QUERY_SNAP_SEND_PROGRESS to mgmt queue
  • mmHandle.c: route TDMT_DND_QUERY_SNAP_SEND_PROGRESS_RSP to mnode read queue
  • mndMain.c: register mndInitSnapSend/mndCleanupSnapSend step; call mndSnapSendPullup in timer
  • mndShow.c: route ins_snap_send_vnodes/ins_snap_send_filesets table names to enum types
  • syncMain.c: move syncSnapshotSending() out of BUILD_NO_CALL guard so it links

Description

Issue(s)

  • Close/close/Fix/fix/Resolve/resolve: Issue Link

Checklist

Please check the items in the checklist if applicable.

  • Is the user manual updated?
  • Are the test cases passed and automated?
  • Is there no significant decrease in test coverage?

xiao-77 added 3 commits May 10, 2026 12:39
Tasks 1-4: define SSnapSendFileSetStat/SSnapSendVnodeStat structs in tsdb.h,
add pSnapStat+snapStatLock to STsdb (init/destroy in tsdbOpen/Close),
instrument tsdbSnapshot.c (ROW path) and tsdbSnapshotRAW.c (RAW path)
to track totalSize, sentSize, startTime and finishedFileSets.
…e handler, mndMain/Show registration

- mndSnapSend.h/c: mnode snap-send progress cache (pHash), timer pullup,
  retrieve handlers for ins_snap_send_vnodes/ins_snap_send_filesets
- vnodeSnapshot.c: vnodeGetSnapSendProgress() copies tsdb progress under read-lock
- vnodeInt.h: declare vnodeGetSnapSendProgress()
- vmHandle.c: vmProcessDnodeQuerySnapSendProgressReq() iterates vnodes,
  builds SDnodeQuerySnapSendProgressRsp, registers TDMT_DND_QUERY_SNAP_SEND_PROGRESS
- vmWorker.c: dispatch TDMT_DND_QUERY_SNAP_SEND_PROGRESS to mgmt queue
- mmHandle.c: route TDMT_DND_QUERY_SNAP_SEND_PROGRESS_RSP to mnode read queue
- mndMain.c: register mndInitSnapSend/mndCleanupSnapSend step; call mndSnapSendPullup in timer
- mndShow.c: route ins_snap_send_vnodes/ins_snap_send_filesets table names to enum types
- syncMain.c: move syncSnapshotSending() out of BUILD_NO_CALL guard so it links
Copilot AI review requested due to automatic review settings May 11, 2026 01:47
@xiao-77 xiao-77 requested review from a team, dapan1121 and guanshengliang as code owners May 11, 2026 01:47
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements a monitoring system for snapshot transfer progress, adding system tables ins_snap_send_vnodes and ins_snap_send_filesets. The implementation involves new message types for mnode-dnode communication, internal TSDB statistics for tracking fileset transfers, and updates to vgroup state management. Review feedback highlights a critical heap corruption risk in the dnode response handler and potential memory leaks in the TSDB layer. Further improvements were suggested to optimize hash iteration efficiency, handle uninitialized timestamps in elapsed time calculations, and replace magic numbers with defined constants.

Comment thread source/dnode/mgmt/mgmt_vnode/src/vmHandle.c Outdated
Comment thread source/dnode/vnode/src/tsdb/tsdbSnapshot.c
Comment thread source/dnode/mnode/impl/src/mndSnapSend.c Outdated
Comment thread source/dnode/mnode/impl/src/mndSnapSend.c
Comment thread source/dnode/mnode/impl/src/mndSnapSend.c Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a “snapshot-send progress” reporting pipeline from tsdb/vnode → dnode mgmt RPC → mnode cache, and exposes it via new ins_snap_send_vnodes / ins_snap_send_filesets system tables.

Changes:

  • Track snapshot-send progress inside the tsdb snapshot readers (fileset totals/sent bytes/timestamps) and expose it via vnodeGetSnapSendProgress().
  • Add a new dnode RPC (TDMT_DND_QUERY_SNAP_SEND_PROGRESS) plus mnode-side pullup + in-memory cache + show-table retrieval handlers.
  • Extend status heartbeat serialization to include SVnodeLoad.snapshotSending, and wire it through mnode vgroup state (snapRestoring) to decide which vgroups to query.

Reviewed changes

Copilot reviewed 24 out of 24 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
source/libs/sync/src/syncMain.c Moves syncSnapshotSending() out of BUILD_NO_CALL guard; adjusts snapshot send/recv helpers.
source/dnode/vnode/src/vnd/vnodeSnapshot.c Adds vnodeGetSnapSendProgress() to copy tsdb-layer progress under lock.
source/dnode/vnode/src/vnd/vnodeQuery.c Reports snapshotSending in SVnodeLoad using syncSnapshotSending().
source/dnode/vnode/src/tsdb/tsdbSnapshotRAW.c Builds/updates/clears per-fileset progress stats for RAW snapshot reads.
source/dnode/vnode/src/tsdb/tsdbSnapshot.c Builds/updates/clears per-fileset progress stats for ranged snapshot reads.
source/dnode/vnode/src/tsdb/tsdbOpen.c Initializes/destroys snapStatLock in STsdb.
source/dnode/vnode/src/inc/vnodeInt.h Declares vnodeGetSnapSendProgress().
source/dnode/vnode/src/inc/tsdb.h Adds internal structs and fields (pSnapStat, snapStatLock) for progress tracking.
source/dnode/mnode/impl/src/mndVgroup.c Bumps vgroup raw version and adds snapRestoring encode/decode/update behavior.
source/dnode/mnode/impl/src/mndSnapSend.c New mnode module: pullup querying, response handling, and systable retrieval.
source/dnode/mnode/impl/src/mndShow.c Routes new ins_snap_send_* names to show retrieve types.
source/dnode/mnode/impl/src/mndMain.c Registers snap-send module init/cleanup and triggers pullup from the timer.
source/dnode/mnode/impl/src/mndDnode.c Populates snapRestoring from status heartbeat snapshotSending.
source/dnode/mnode/impl/inc/mndSnapSend.h New header for snap-send module init/cleanup/pullup.
source/dnode/mnode/impl/inc/mndDef.h Adds snapRestoring to SVgObj (ephemeral snapshot-send flag).
source/dnode/mgmt/mgmt_vnode/src/vmWorker.c Dispatches TDMT_DND_QUERY_SNAP_SEND_PROGRESS into the vnode mgmt queue.
source/dnode/mgmt/mgmt_vnode/src/vmHandle.c Implements vmProcessDnodeQuerySnapSendProgressReq() and registers the handler.
source/dnode/mgmt/mgmt_mnode/src/mmHandle.c Routes snap-send progress RSP into the mnode read queue.
source/common/src/systable.c Adds schema/meta entries for ins_snap_send_vnodes and ins_snap_send_filesets.
source/common/src/msg/tmsg.c Extends SStatusReq ser/de and adds snap-send progress RSP ser/de/free.
include/common/tmsgdef.h Adds new message type TDMT_DND_QUERY_SNAP_SEND_PROGRESS.
include/common/tmsg.h Adds new show table enums, SVnodeLoad.snapshotSending, and snap-send progress structs.
include/common/systable.h Adds systable name macros for ins_snap_send_vnodes/filesets.
.claude/settings.local.json Adds a local tool settings file (permissions for gh pr *).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1059 to +1062
rsp.pVnodeInfos = (rsp.numOfVnodes > 0)
? (SSnapSendVnodeInfo *)taosArrayGet(pInfoArray, 0)
: NULL;

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed: vmHandle.c now uses manual cleanup — it iterates pInfoArray to free each pFileSetInfos, then destroys the array directly. tFreeSDnodeQuerySnapSendProgressRsp is never called on the sending side (only appropriate for deserialized responses).

Comment on lines +8319 to +8326
void tFreeSDnodeQuerySnapSendProgressRsp(SDnodeQuerySnapSendProgressRsp *pRsp) {
if (pRsp == NULL) return;
if (pRsp->pVnodeInfos != NULL) {
for (int32_t i = 0; i < pRsp->numOfVnodes; i++) {
taosMemoryFreeClear(pRsp->pVnodeInfos[i].pFileSetInfos);
}
taosMemoryFreeClear(pRsp->pVnodeInfos);
}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed — the ownership rule is now documented in both the serializer's header comment and the vmHandle.c code comment: tFreeSDnodeQuerySnapSendProgressRsp is only for deserialized responses (caller owns a heap-allocated pVnodeInfos). The sending side never calls it; it manually frees each nested pFileSetInfos and destroys the array.

Comment on lines +114 to +123
// Evict old entry (free its pFileSetInfos)
SSnapSendVnodeInfo *pOld = taosHashGet(gSnapSendMgmt.pHash, &pSrc->vgId, sizeof(int32_t));
if (pOld) snapSendFreeVnodeInfo(pOld);

// Upsert
if (taosHashPut(gSnapSendMgmt.pHash, &pSrc->vgId, sizeof(int32_t), &copy, sizeof(copy)) != 0) {
taosMemoryFree(pFsCopy);
code = TSDB_CODE_OUT_OF_MEMORY;
break;
}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed: the upsert path now saves pOld->pFileSetInfos to pOldFs before calling taosHashPut, then frees pOldFs only after a successful put. If taosHashPut fails, the existing hash entry retains its original (valid) pFileSetInfos pointer.

Comment thread source/dnode/mnode/impl/src/mndVgroup.c Outdated
pNew->compStorage = pOld->compStorage;
pNew->pointsWritten = pOld->pointsWritten;
pNew->compact = pOld->compact;
pOld->snapRestoring = pNew->snapRestoring;
Comment on lines +949 to +951
* 0 — progress data available, *pInfo populated (caller
* must taosMemoryFree(pInfo->pFileSetInfos))
* TSDB_CODE_NOT_FOUND — no active snapshot send on this vnode
Comment on lines 119 to 126
tsdbCloseFS(&(*pTsdb)->pFS);
tsdbCloseCache(*pTsdb);
#ifdef TD_ENTERPRISE
tsdbCloseCompMonitor(*pTsdb);
#endif
(void)taosThreadMutexDestroy(&(*pTsdb)->mutex);
(void)taosThreadRwlockDestroy(&(*pTsdb)->snapStatLock);
taosMemoryFreeClear(*pTsdb);
Comment thread source/libs/sync/src/syncMain.c Outdated
Comment on lines +588 to +599
#ifdef BUILD_NO_CALL
bool syncSnapshotSending(int64_t rid) {
bool syncSnapshotRecving(int64_t rid) {
SSyncNode* pSyncNode = syncNodeAcquire(rid);
if (pSyncNode == NULL) {
return false;
}

bool b = syncNodeSnapshotSending(pSyncNode);
bool b = syncNodeSnapshotRecving(pSyncNode);
syncNodeRelease(pSyncNode);
return b;
}
#endif
Comment on lines +359 to +363
// transfer_type: "raw" or "row"
const char *typeStr = (pFs->transferType == 14) ? "raw" : "row";
char varType[8 + VARSTR_HEADER_SIZE];
STR_TO_VARSTR(varType, typeStr);
pColInfo = taosArrayGet(pBlock->pDataBlock, cols++);
Comment thread .claude/settings.local.json Outdated
Comment on lines +3 to +5
"allow": [
"Bash(gh pr *)"
]
xiao-77 added 2 commits May 11, 2026 14:01
vmHandle.c — vmProcessDnodeQuerySnapSendProgressReq:
  rsp.pVnodeInfos aliased pInfoArray's internal buffer.
  Calling tFreeSDnodeQuerySnapSendProgressRsp freed that buffer,
  then taosArrayDestroy freed it again (double-free / heap corruption).
  Fix: consolidate cleanup into a single _exit block that manually
  frees each element's pFileSetInfos, then destroys pInfoArray.
  Never call tFreeSDnodeQuerySnapSendProgressRsp on a non-owned buffer.

mndSnapSend.c — mndSnapSendPullup stale-entry removal:
  taosHashRemove inside taosHashIterate then restarting the iterator
  with NULL causes iterator corruption (re-visits / skips entries).
  Fix: collect stale vgIds into a temporary SArray while iterating,
  then remove all of them after the walk completes.
…snap-send-progress pullup interval config; fix hash-put race in mndSnapSend; add information_schema test cases

- SSnapSendFileSetStat/SSnapSendFileSetInfo: rename sttCount→fileCount
  (now counts all physical files: HEAD/DATA/SMA/STT); add finishedFileCount
  field; rename sentSize→readSize for clarity
- tsdbSnapshot.c / tsdbSnapshotRAW.c: count data files in fileCount loop;
  set finishedFileCount=fileCount on RangeEnd; track RAW finishedFileCount
  via dataIter->idx per chunk; rename sentSize→readSize accumulation
- mndSnapSend.c: fix race: save pOldFs, do taosHashPut first, then free
  pOldFs — prevents NULL pFileSetInfos with fileSetCount>0 on put failure;
  use tsSnapSendPullupInterval instead of hard-coded 10s
- tglobal.h/c: add tsSnapSendPullupInterval config (default 10s, range 1-10000)
- mndMain.c: separate snapSendPullup from compactPullup, use own interval
- syncSnapshot.c: add 30s debug sleep in snapshotSenderStart for testing
- tests/system-test/0-others/information_schema.py: add ins_snap_send_check()
  verifying both tables exist, have correct columns, and return 0 rows when
  no snapshot is active
@xiao-77 xiao-77 requested a review from zitsen as a code owner May 12, 2026 09:45
xiao-77 added 3 commits May 12, 2026 17:51
…ing guide

- zh/en 22-meta.md: add INS_SNAP_SEND_VNODES and INS_SNAP_SEND_FILESETS
  table definitions with all 6 / 11 columns, types and descriptions
- zh/en 04-maintenance.md: add '监控 Snapshot 发送进度 / Monitoring Snapshot
  Send Progress' section with example queries for vnode-level progress,
  fileset-level drill-down, detecting stalled transfers, and usage notes
  (pull interval, elapsed calculation, RAW vs ROW read_size semantics)
Copilot AI review requested due to automatic review settings May 12, 2026 10:00
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 31 out of 31 changed files in this pull request and generated 17 comments.

Comment thread source/dnode/mnode/impl/src/mndVgroup.c Outdated
pNew->compStorage = pOld->compStorage;
pNew->pointsWritten = pOld->pointsWritten;
pNew->compact = pOld->compact;
pOld->snapRestoring = pNew->snapRestoring;
Comment thread source/libs/sync/src/syncMain.c Outdated
Comment on lines 588 to 592
#ifdef BUILD_NO_CALL
bool syncSnapshotSending(int64_t rid) {
bool syncSnapshotRecving(int64_t rid) {
SSyncNode* pSyncNode = syncNodeAcquire(rid);
if (pSyncNode == NULL) {
return false;
Comment on lines 402 to +404
SCompMonitor *pCompMonitor;
SSnapSendVnodeStat *pSnapStat; // NULL when no active snapshot send
TdThreadRwlock snapStatLock; // protects pSnapStat (readers: mnode query; writer: snapshot reader)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TDengine's sync module serializes snapshot sending: a vnode can only be sending a snapshot to one follower at a time (enforced by the SSyncSender state machine). Therefore, there is at most one active tsdbSnapReader/tsdbSnapRAWReader per STsdb at any given time, making tsdb->pSnapStat a safe single pointer. We agree this assumption should be documented with a comment in the code.

Comment on lines +526 to +528
(void)taosThreadRwlockWrlock(&tsdb->snapStatLock);
tsdb->pSnapStat = pStat;
(void)taosThreadRwlockUnlock(&tsdb->snapStatLock);
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TDengine's sync module serializes snapshot sending: a vnode can only be sending a snapshot to one follower at a time (enforced by the SSyncSender state machine). Therefore, there is at most one active tsdbSnapReader/tsdbSnapRAWReader per STsdb at any given time, making tsdb->pSnapStat a safe single pointer. We agree this assumption should be documented with a comment in the code.

Comment on lines +574 to +581
// clear progress stat
(void)taosThreadRwlockWrlock(&tsdb->snapStatLock);
if (tsdb->pSnapStat != NULL) {
taosMemoryFree(tsdb->pSnapStat->pFileSetStats);
taosMemoryFree(tsdb->pSnapStat);
tsdb->pSnapStat = NULL;
}
(void)taosThreadRwlockUnlock(&tsdb->snapStatLock);
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TDengine's sync module serializes snapshot sending: a vnode can only be sending a snapshot to one follower at a time (enforced by the SSyncSender state machine). Therefore, there is at most one active tsdbSnapReader/tsdbSnapRAWReader per STsdb at any given time, making tsdb->pSnapStat a safe single pointer. We agree this assumption should be documented with a comment in the code.

Comment on lines +382 to +386

// transfer_type: "raw" or "row"
const char *typeStr = (pFs->transferType == 14) ? "raw" : "row";
char varType[8 + VARSTR_HEADER_SIZE];
STR_TO_VARSTR(varType, typeStr);
Comment on lines +438 to +445
static const SSysDbTableSchema snapSendVnodesSchema[] = {
{.name = "vgroup_id", .bytes = 4, .type = TSDB_DATA_TYPE_INT, .sysInfo = false},
{.name = "dnode_id", .bytes = 4, .type = TSDB_DATA_TYPE_INT, .sysInfo = false},
{.name = "total_file_sets", .bytes = 4, .type = TSDB_DATA_TYPE_INT, .sysInfo = false},
{.name = "finished_file_sets", .bytes = 4, .type = TSDB_DATA_TYPE_INT, .sysInfo = false},
{.name = "start_time", .bytes = 8, .type = TSDB_DATA_TYPE_TIMESTAMP, .sysInfo = false},
{.name = "elapsed", .bytes = 16 + VARSTR_HEADER_SIZE, .type = TSDB_DATA_TYPE_VARCHAR, .sysInfo = false},
};
Comment on lines +289 to +295
// elapsed (VARCHAR "HH:MM:SS")
snapSendFmtElapsed(pInfo->startTime, elapsedBuf, sizeof(elapsedBuf));
char varElapsed[32 + VARSTR_HEADER_SIZE];
STR_TO_VARSTR(varElapsed, elapsedBuf);
pColInfo = taosArrayGet(pBlock->pDataBlock, cols++);
if (colDataSetVal(pColInfo, numOfRows, varElapsed, false) != 0) {
code = TSDB_CODE_OUT_OF_MEMORY; break;
Comment on lines +407 to +415
# ins_snap_send_filesets: table must exist and have the expected 10 columns.
tdSql.query('select * from information_schema.ins_snap_send_filesets')
tdSql.checkRows(0)
tdSql.query('describe information_schema.ins_snap_send_filesets')
col_names = [row[0] for row in tdSql.queryResult]
for col in ['vgroup_id', 'fid', 'file_count', 'finished_file_count',
'total_size', 'read_size', 'start_time', 'elapsed',
'start_index', 'end_index', 'transfer_type']:
tdSql.checkEqual(True, col in col_names)
Comment thread .claude/settings.local.json Outdated
Comment on lines +3 to +5
"allow": [
"Bash(gh pr *)"
]
xiao-77 added 2 commits May 13, 2026 13:57
- mndVgroup.c: fix snapRestoring copy direction (pOld→pNew, preserves
  ephemeral in-memory state across SDB updates)
- syncMain.c: remove BUILD_NO_CALL guard from syncSnapshotRecving so it
  links in normal builds (consistent with syncSnapshotSending)
- tsdbOpen.c: free pSnapStat + pFileSetStats in tsdbClose under wrlock
  before destroying the rwlock, preventing memory leak on vnode close
- mndSnapSend.h/c: fix comment typo tsCompactPullupInterval →
  tsSnapSendPullupInterval; add SNAP_TRANSFER_TYPE_RAW #define instead
  of magic number 14; add startTimeMs==0 guard in snapSendFmtElapsed
- vmHandle.c: abort snap-send-progress request with OOM on taosArrayPush
  failure instead of silently continuing with partial results
- systable.c: widen elapsed VARCHAR from 16 to 32 bytes in both
  ins_snap_send_vnodes and ins_snap_send_filesets schemas
- vnodeSnapshot.c: update doc comment to include TSDB_CODE_OUT_OF_MEMORY
  as a possible return value
- information_schema.py: fix comment '10 columns' → '11 columns'
- .gitignore: add .claude/ to prevent local AI tool settings from being
  committed; remove .claude/settings.local.json from git tracking
- pActiveVgIds put failure: skip vgroup with continue to prevent incorrect
  stale-entry eviction of its progress cache (data correctness fix)
- pDnodesToQuery put failure: log only — missed query retried next cycle
- pStaleVgIds push failure: log only — stale entry survives one extra cycle
Copilot AI review requested due to automatic review settings May 13, 2026 07:07
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 30 out of 31 changed files in this pull request and generated 2 comments.

Comment thread source/dnode/mnode/impl/src/mndSnapSend.c
Comment thread source/dnode/vnode/src/inc/tsdb.h
@guanshengliang guanshengliang merged commit a64d4a2 into 3.3.6 May 13, 2026
16 checks passed
@guanshengliang guanshengliang deleted the feature/snap-send-progress branch May 13, 2026 12:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants