NM-317: scale fixes#3979
Conversation
…to fixes/v1.5.1
…to fixes/v1.5.1
…to fixes/v1.5.1
…l/netmaker into fix/set-default-using-api
…into patch/move-to-migrate
Remove SetDNS call;
…tmaker into fix/empty-out-set-dns
fix(go): empty out set dns;
v1.5.1: update node check-in on latest node object
…into fix/set-default-using-api
Fix(v1.5.1): set host as default using api;
- Egress: snapshot node tag IDs under the per-node mutex (and fix shared mutex in the node cache) so shallow-copied nodes do not range Tags while other goroutines write. - Pro metrics: return a cloned Connectivity map from GetMetrics so status and MQTT updates cannot iterate and mutate the same map. - HTTP: always close/drain response bodies, remove defer-in-loop in GetPublicIP, add bounded client timeouts (CLI, EMQX, geo-IP, OAuth, license validation), and fix Azure IDP Verify() double-close.
|
Tenki Code Review - Complete Files Reviewed: 13 By Severity:
This PR adds HTTP client timeouts and response body cleanup throughout the codebase — a solid improvement — but introduces two concurrency bugs in the new mutex-protection helpers and an infinite loop in the CLI's refactored retry logic. Files Reviewed (13 files) |
There was a problem hiding this comment.
Expand to see previous review
Overview
This PR makes two broad improvements: (1) adds explicit Timeout values to every http.Client that previously had none, and (2) introduces per-node mutex protection (node.Mutex) to guard concurrent access to node.Tags maps. The HTTP timeout work is clean and well-executed. The metricsReadCopy helper in pro/logic/metrics.go and the io.Copy(io.Discard, ...) drain patterns are correct.
Bugs Found
cli/functions/http_client.go — Infinite loop on 401 with master key
The goto retry pattern was replaced with an infinite for loop and a continue. The retry guard is !retried && ctx.MasterKey == "". When ctx.MasterKey != "" and the server returns HTTP 401, the guard is false so retried is never set and continue is never called — but the subsequent res.StatusCode != http.StatusOK check is never reached either, because the 401 check sits above it with no else. The loop body falls through back to the top and repeats indefinitely.
logic/egress.go — Pre-lock len(n.Tags) is itself a data race
The new snapshotNodeTagIDs() function checks len(n.Tags) == 0 at the top (line 127) without holding n.Mutex. The function was introduced specifically to prevent concurrent map access, but the unprotected len() call is a data race if another goroutine is mutating n.Tags at the same time. Go's race detector will flag this.
logic/nodes.go — TOCTOU window in getNodesFromCache
getNodesFromCache() uses a two-phase lock: check needsMutex under RLock, release, then re-acquire a write Lock. In the gap between RUnlock (line 78) and Lock (line 80), a concurrent loadNodesIntoCache() can replace the nodesCacheMap variable entirely. The subsequent write-lock iteration then operates on the newly-loaded map instead of the one the caller expected, potentially returning an inconsistent snapshot.
Positive Changes
- All HTTP clients now have timeouts (
utils/geoinfo.go,mq/emqx_on_prem.go,pro/idp/azure/azure.go,pro/auth/github.go,pro/auth/azure-ad.go,pro/license.go,servercfg/serverconf.go). defer resp.Body.Close()inside a loop was correctly replaced with explicitClose()calls to prevent descriptor leaks.- Response bodies are now properly drained before closing in
mq/migrate.goandpro/license.go. metricsReadCopydefensively clones theConnectivitymap to prevent callers from sharing cached state with mutating goroutines.
… and node cache reads - Egress: take the per-node mutex before any use of Tags in snapshotNodeTagIDs; keep tag key snapshots for shared shallow-copied nodes. - Node cache: read the full node list under a single RLock in getNodesFromCache (remove two-phase upgrade; all store paths set Mutex before insert). - Pro metrics: return a cloned Connectivity map from GetMetrics to avoid concurrent map iteration and writes in status/metrics paths. - HTTP: close and/or drain response bodies, add client timeouts, fix GetPublicIP defer-in-loop, license retry body close, Azure IDP Verify double-defer, shared EMQX/geo clients, and CLI request 401 handling (master key / post-retry 401).
|
Tenki Code Review - Complete Files Reviewed: 13 By Severity:
This PR is a solid reliability improvement — adding HTTP client timeouts everywhere they were missing, fixing resource-body leaks, and introducing per-node mutex snapshotting to prevent concurrent-map iteration crashes. Two observations are noted: a shared-map reference that persists in getNodesFromCache, and an implicit loop-termination contract in the refactored CLI retry loop. Files Reviewed (13 files) |
There was a problem hiding this comment.
Overview
This PR systematically addresses three classes of issues across the codebase:
-
HTTP client timeouts — every
&http.Client{}orhttp.DefaultClientthat had no timeout has been replaced with a concrete timeout (20 s–120 s), preventing goroutine hangs on unresponsive external services (EMQX, Microsoft Graph, GitHub, geo-IP APIs, license server, CLI endpoints). -
Response body resource management —
res.Body.Close()is now paired correctly with draining (io.Copy(io.Discard, ...)) before close inmq/migrate.goandpro/license.go; the deferred-close-inside-a-loop anti-pattern inservercfg/serverconf.gois corrected. -
Concurrent-map safety —
snapshotNodeTagIDsinlogic/egress.goacquires each node's per-Mutexbefore copyingTagskeys into a stable slice, replacing unsafefor tagID := range targetNode.Tagsloops that could race with other goroutines mutating the map.getNodeFromCacheinlogic/nodes.gonow initialisesnode.Mutexunder the write-lock before returning, and all store/load helpers ensureMutex != nil.GetMetricsinpro/logic/metrics.goreturns a deep-clonedConnectivitymap so MQTT writers and metric-monitor goroutines cannot mutate the caller's view.
Observations
getNodesFromCache still returns shared Tags references (logic/nodes.go)
getNodesFromCache appends node values, but Go map fields are reference types — the returned Node.Tags map is the same object as the one in the cache. snapshotNodeTagIDs protects its own iteration via n.Mutex, but any other caller that receives a node from getNodesFromCache and writes to Tags would race with snapshotNodeTagIDs on the same underlying map. The PR's direct fix for AddEgressInfoToPeerByAccess / GetNodeEgressInfo is correct; the fuller fix is to clone Tags at the cache boundary, as is done for Connectivity in metricsReadCopy.
Infinite for {} loop in cli/functions/http_client.go
The refactored request() function replaces the old goto retry with an infinite for loop. All current exits are either return or log.Fatalf, so no current path loops forever. The termination contract is implicit — a future maintainer adding a new status-code branch without a return/fatal would inadvertently create a spin loop. A comment or structural change would make this contract visible.
| if len(resBodyBytes) > 0 { | ||
| if err := json.Unmarshal(resBodyBytes, body); err != nil { | ||
| log.Fatalf("Error unmarshalling JSON: %s %s", err, string(resBodyBytes)) | ||
| for { |
There was a problem hiding this comment.
🟡 Implicit termination contract in the infinite for loop in request() (bug)
The refactored request() function in cli/functions/http_client.go (lines 171-214) uses an unbounded for {} loop. All current exit paths are either return body (success) or log.Fatalf (errors and repeated 401). There is no risk of spinning today. However, the termination contract is entirely implicit: a future developer adding a new HTTP status-code case without a return/fatal would create a silent infinite retry loop. The previous goto retry was self-documenting about being a single retry; the new loop is not.
💡 Suggestion: Add a comment immediately before the for { statement such as: // Every code path inside this loop must call return or log.Fatalf; there is no loop condition because the retry is bounded by the retried flag. Alternatively, restructure as a for attempt := 0; attempt <= 1; attempt++ loop to make the retry bound explicit.
📋 Prompt for AI Agents
In cli/functions/http_client.go at line 171, add a comment immediately before the for { line to make the termination contract explicit. The comment should read: // Every path in this loop terminates via return (success) or log.Fatalf (error / repeated 401). The retried flag ensures at most one token-refresh retry. This prevents future maintainers from inadvertently adding a status-code branch that silently loops forever.
Describe your changes
Provide Issue ticket number if applicable/not in title
Provide testing steps
Checklist before requesting a review