Skip to content

NM-317: scale fixes#3979

Closed
abhishek9686 wants to merge 139 commits intodevelopfrom
NM-317
Closed

NM-317: scale fixes#3979
abhishek9686 wants to merge 139 commits intodevelopfrom
NM-317

Conversation

@abhishek9686
Copy link
Copy Markdown
Member

Describe your changes

Provide Issue ticket number if applicable/not in title

Provide testing steps

Checklist before requesting a review

  • My changes affect only 10 files or less.
  • I have performed a self-review of my code and tested it.
  • If it is a new feature, I have added thorough tests, my code is <= 1450 lines.
  • If it is a bugfix, my code is <= 200 lines.
  • My functions are <= 80 lines.
  • I have had my code reviewed by a peer.
  • My unit tests pass locally.
  • Netmaker is awesome.

VishalDalwadi and others added 30 commits March 18, 2026 21:27
VishalDalwadi and others added 20 commits April 14, 2026 16:48
v1.5.1: update node check-in on latest node object
Fix(v1.5.1): set host as default using api;
- Egress: snapshot node tag IDs under the per-node mutex (and fix shared
  mutex in the node cache) so shallow-copied nodes do not range Tags
  while other goroutines write.
- Pro metrics: return a cloned Connectivity map from GetMetrics so
  status and MQTT updates cannot iterate and mutate the same map.
- HTTP: always close/drain response bodies, remove defer-in-loop in
  GetPublicIP, add bounded client timeouts (CLI, EMQX, geo-IP, OAuth,
  license validation), and fix Azure IDP Verify() double-close.
@tenki-reviewer
Copy link
Copy Markdown
Contributor

tenki-reviewer Bot commented Apr 19, 2026

Tenki Code Review - Complete

Files Reviewed: 13
Findings: 3

By Severity:

  • 🔴 High: 1
  • 🟠 Medium: 2

This PR adds HTTP client timeouts and response body cleanup throughout the codebase — a solid improvement — but introduces two concurrency bugs in the new mutex-protection helpers and an infinite loop in the CLI's refactored retry logic.

Files Reviewed (13 files)
cli/functions/http_client.go
logic/egress.go
logic/nodes.go
mq/emqx_cloud.go
mq/emqx_on_prem.go
mq/migrate.go
pro/auth/azure-ad.go
pro/auth/github.go
pro/idp/azure/azure.go
pro/license.go
pro/logic/metrics.go
servercfg/serverconf.go
utils/geoinfo.go

Copy link
Copy Markdown
Contributor

@tenki-reviewer tenki-reviewer Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expand to see previous review

Overview

This PR makes two broad improvements: (1) adds explicit Timeout values to every http.Client that previously had none, and (2) introduces per-node mutex protection (node.Mutex) to guard concurrent access to node.Tags maps. The HTTP timeout work is clean and well-executed. The metricsReadCopy helper in pro/logic/metrics.go and the io.Copy(io.Discard, ...) drain patterns are correct.

Bugs Found

cli/functions/http_client.go — Infinite loop on 401 with master key

The goto retry pattern was replaced with an infinite for loop and a continue. The retry guard is !retried && ctx.MasterKey == "". When ctx.MasterKey != "" and the server returns HTTP 401, the guard is false so retried is never set and continue is never called — but the subsequent res.StatusCode != http.StatusOK check is never reached either, because the 401 check sits above it with no else. The loop body falls through back to the top and repeats indefinitely.

logic/egress.go — Pre-lock len(n.Tags) is itself a data race

The new snapshotNodeTagIDs() function checks len(n.Tags) == 0 at the top (line 127) without holding n.Mutex. The function was introduced specifically to prevent concurrent map access, but the unprotected len() call is a data race if another goroutine is mutating n.Tags at the same time. Go's race detector will flag this.

logic/nodes.go — TOCTOU window in getNodesFromCache

getNodesFromCache() uses a two-phase lock: check needsMutex under RLock, release, then re-acquire a write Lock. In the gap between RUnlock (line 78) and Lock (line 80), a concurrent loadNodesIntoCache() can replace the nodesCacheMap variable entirely. The subsequent write-lock iteration then operates on the newly-loaded map instead of the one the caller expected, potentially returning an inconsistent snapshot.

Positive Changes

  • All HTTP clients now have timeouts (utils/geoinfo.go, mq/emqx_on_prem.go, pro/idp/azure/azure.go, pro/auth/github.go, pro/auth/azure-ad.go, pro/license.go, servercfg/serverconf.go).
  • defer resp.Body.Close() inside a loop was correctly replaced with explicit Close() calls to prevent descriptor leaks.
  • Response bodies are now properly drained before closing in mq/migrate.go and pro/license.go.
  • metricsReadCopy defensively clones the Connectivity map to prevent callers from sharing cached state with mutating goroutines.

Comment thread cli/functions/http_client.go
Comment thread logic/egress.go
Comment thread logic/nodes.go
… and node cache reads

- Egress: take the per-node mutex before any use of Tags in snapshotNodeTagIDs;
  keep tag key snapshots for shared shallow-copied nodes.
- Node cache: read the full node list under a single RLock in getNodesFromCache
  (remove two-phase upgrade; all store paths set Mutex before insert).
- Pro metrics: return a cloned Connectivity map from GetMetrics to avoid concurrent
  map iteration and writes in status/metrics paths.
- HTTP: close and/or drain response bodies, add client timeouts, fix GetPublicIP
  defer-in-loop, license retry body close, Azure IDP Verify double-defer, shared
  EMQX/geo clients, and CLI request 401 handling (master key / post-retry 401).
@abhishek9686
Copy link
Copy Markdown
Member Author

@tenki-reviewer

@tenki-reviewer
Copy link
Copy Markdown
Contributor

tenki-reviewer Bot commented Apr 20, 2026

Tenki Code Review - Complete

Files Reviewed: 13
Findings: 1

By Severity:

  • 🟡 Low: 1

This PR is a solid reliability improvement — adding HTTP client timeouts everywhere they were missing, fixing resource-body leaks, and introducing per-node mutex snapshotting to prevent concurrent-map iteration crashes. Two observations are noted: a shared-map reference that persists in getNodesFromCache, and an implicit loop-termination contract in the refactored CLI retry loop.

Files Reviewed (13 files)
cli/functions/http_client.go
logic/egress.go
logic/nodes.go
mq/emqx_cloud.go
mq/emqx_on_prem.go
mq/migrate.go
pro/auth/azure-ad.go
pro/auth/github.go
pro/idp/azure/azure.go
pro/license.go
pro/logic/metrics.go
servercfg/serverconf.go
utils/geoinfo.go

Copy link
Copy Markdown
Contributor

@tenki-reviewer tenki-reviewer Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overview

This PR systematically addresses three classes of issues across the codebase:

  1. HTTP client timeouts — every &http.Client{} or http.DefaultClient that had no timeout has been replaced with a concrete timeout (20 s–120 s), preventing goroutine hangs on unresponsive external services (EMQX, Microsoft Graph, GitHub, geo-IP APIs, license server, CLI endpoints).

  2. Response body resource managementres.Body.Close() is now paired correctly with draining (io.Copy(io.Discard, ...)) before close in mq/migrate.go and pro/license.go; the deferred-close-inside-a-loop anti-pattern in servercfg/serverconf.go is corrected.

  3. Concurrent-map safetysnapshotNodeTagIDs in logic/egress.go acquires each node's per-Mutex before copying Tags keys into a stable slice, replacing unsafe for tagID := range targetNode.Tags loops that could race with other goroutines mutating the map. getNodeFromCache in logic/nodes.go now initialises node.Mutex under the write-lock before returning, and all store/load helpers ensure Mutex != nil. GetMetrics in pro/logic/metrics.go returns a deep-cloned Connectivity map so MQTT writers and metric-monitor goroutines cannot mutate the caller's view.

Observations

getNodesFromCache still returns shared Tags references (logic/nodes.go)

getNodesFromCache appends node values, but Go map fields are reference types — the returned Node.Tags map is the same object as the one in the cache. snapshotNodeTagIDs protects its own iteration via n.Mutex, but any other caller that receives a node from getNodesFromCache and writes to Tags would race with snapshotNodeTagIDs on the same underlying map. The PR's direct fix for AddEgressInfoToPeerByAccess / GetNodeEgressInfo is correct; the fuller fix is to clone Tags at the cache boundary, as is done for Connectivity in metricsReadCopy.

Infinite for {} loop in cli/functions/http_client.go

The refactored request() function replaces the old goto retry with an infinite for loop. All current exits are either return or log.Fatalf, so no current path loops forever. The termination contract is implicit — a future maintainer adding a new status-code branch without a return/fatal would inadvertently create a spin loop. A comment or structural change would make this contract visible.

if len(resBodyBytes) > 0 {
if err := json.Unmarshal(resBodyBytes, body); err != nil {
log.Fatalf("Error unmarshalling JSON: %s %s", err, string(resBodyBytes))
for {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Implicit termination contract in the infinite for loop in request() (bug)

The refactored request() function in cli/functions/http_client.go (lines 171-214) uses an unbounded for {} loop. All current exit paths are either return body (success) or log.Fatalf (errors and repeated 401). There is no risk of spinning today. However, the termination contract is entirely implicit: a future developer adding a new HTTP status-code case without a return/fatal would create a silent infinite retry loop. The previous goto retry was self-documenting about being a single retry; the new loop is not.

💡 Suggestion: Add a comment immediately before the for { statement such as: // Every code path inside this loop must call return or log.Fatalf; there is no loop condition because the retry is bounded by the retried flag. Alternatively, restructure as a for attempt := 0; attempt <= 1; attempt++ loop to make the retry bound explicit.

📋 Prompt for AI Agents

In cli/functions/http_client.go at line 171, add a comment immediately before the for { line to make the termination contract explicit. The comment should read: // Every path in this loop terminates via return (success) or log.Fatalf (error / repeated 401). The retried flag ensures at most one token-refresh retry. This prevents future maintainers from inadvertently adding a status-code branch that silently loops forever.

@abhishek9686 abhishek9686 changed the base branch from release-v1.5.1 to develop April 21, 2026 11:47
@abhishek9686 abhishek9686 marked this pull request as draft April 22, 2026 05:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants