Skip to content

Add cache-aware placement affinity#2867

Open
ValentaTomas wants to merge 39 commits into
mainfrom
valenta/api-placement-affinity
Open

Add cache-aware placement affinity#2867
ValentaTomas wants to merge 39 commits into
mainfrom
valenta/api-placement-affinity

Conversation

@ValentaTomas
Copy link
Copy Markdown
Member

@ValentaTomas ValentaTomas commented May 30, 2026

Add disabled-by-default cache-aware placement affinity using recent successful build placements. Redis state is bounded by TTL/top-N limits and is best-effort.

Record recent successful placements and use them as a small, gated best-of-K bias so warm nodes are preferred without bypassing placement filters.
@cla-bot cla-bot Bot added the cla-signed label May 30, 2026
@cursor
Copy link
Copy Markdown

cursor Bot commented May 30, 2026

PR Summary

Medium Risk
Changes core sandbox scheduling and node selection; mis-tuned affinity or exhausted-node retry loops could skew load or add latency, though the feature is off by default and Redis updates are best-effort.

Overview
Adds disabled-by-default cache-aware sandbox placement: recent successful placements per build are stored in Redis (bounded TTL and top-N) and used as a score bonus in Best-of-K scheduling so repeat runs favor nodes that likely already have the image cached. Resume skips build affinity so it does not fight node pinning; after a successful create, affinity is recorded best-effort in the background.

Placement now treats resource-exhausted nodes separately from hard failures—exhausted nodes can be retried after a short backoff instead of counting only toward the fixed retry limit—and CreateSandbox can trigger on-demand node discovery when the cluster has no connected nodes (including a local orchestrator fallback).

Reviewed by Cursor Bugbot for commit b9916c4. Bugbot is set up for automated code reviews on this repo. Configure here.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 30, 2026

❌ 2 Tests Failed:

Tests completed Failed Passed Skipped
2703 2 2701 7
View the full list of 2 ❄️ flaky test(s)
github.com/e2b-dev/infra/tests/integration/internal/tests/orchestrator::TestSandboxMemoryIntegrity

Flake rate in main: 57.71% (Passed 740 times, Failed 1010 times)

Stack Traces | 59.7s run time
=== RUN   TestSandboxMemoryIntegrity
=== PAUSE TestSandboxMemoryIntegrity
=== CONT  TestSandboxMemoryIntegrity
    sandbox_memory_integrity_test.go:27: Build completed successfully
--- FAIL: TestSandboxMemoryIntegrity (59.74s)
github.com/e2b-dev/infra/tests/integration/internal/tests/orchestrator::TestSandboxMemoryIntegrity/tmpfs_hash

Flake rate in main: 57.83% (Passed 730 times, Failed 1001 times)

Stack Traces | 201s run time
=== RUN   TestSandboxMemoryIntegrity/tmpfs_hash
=== PAUSE TestSandboxMemoryIntegrity/tmpfs_hash
=== CONT  TestSandboxMemoryIntegrity/tmpfs_hash
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{start:{pid:1260}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stdout:"Total memory: 985 MB\n"}}
Executing command bash in sandbox icffsn677ic4q2ot80wdi (user: root)
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stdout:"Used memory before tmpfs mount: 191 MB\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stdout:"Free memory before tmpfs mount: 793 MB\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stdout:"Memory to use in integrity test (60% of free, min 64MB): 475 MB\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"475+0 records in\n475+0 records out\n498073600 bytes (498 MB, 475 MiB) copied, 1.85918 s, 268 MB/s\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"\tCommand being timed: \"dd if=/dev/urandom of=/mnt/testfile bs=1M count=475\"\n\tUser time (seconds): 0.00\n\tSystem time (seconds): 1.84\n\tPercent of CPU this job got: 98%\n\tElapsed (wall clock) time (h:mm:ss or m:ss): 0:01.86\n\tAverage shared text size (kbytes): 0\n\tAverage unshared data size (kbytes): 0\n\tAverage stack size (kbytes): 0\n\tAverage total size (kbytes): 0\n\tMaximum resident set size (kbytes): 2684\n\tAverage resident set size (kbytes): 0\n\tMajor (requiring I/O) page faults: 3\n\tMinor (reclaiming a frame) page faults: 344\n\tVoluntary context switches: 4\n\tInvoluntary context switches: 32\n\tSwaps: 0\n\tFile system inputs: 176\n\tFile system outputs: 0\n\tSocket messages sent: 0\n\tSocket messages received: 0\n\tSignals delivered: 0\n\tPage size (bytes): 4096\n\tExit status: 0\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stdout:"Used memory after tmpfs mount and file fill: 671 MB\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{end:{exited:true  status:"exit status 0"}}
    sandbox_memory_integrity_test.go:70: Command [bash] completed successfully in sandbox ii54zzuzbbebmn4txkqsd
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
    sandbox_memory_integrity_test.go:80: Command [bash] output: event:{start:{pid:1276}}
Executing command bash in sandbox ic487cegdob9yno4xisvr (user: root)
    sandbox_memory_integrity_test.go:80: Command [bash] output: event:{data:{stdout:"f957f56d0c5dd75d69f891a81e2182c030dcf2df060e08b6e3d426f329c9f4bc\n"}}
    sandbox_memory_integrity_test.go:80: Command [bash] output: event:{end:{exited:true  status:"exit status 0"}}
    sandbox_memory_integrity_test.go:80: Command [bash] completed successfully in sandbox ii54zzuzbbebmn4txkqsd
    sandbox_memory_integrity_test.go:80: Command [bash] output: event:{start:{pid:1280}}
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
Executing command bash in sandbox ii54zzuzbbebmn4txkqsd (user: root)
    sandbox_memory_integrity_test.go:110: 
        	Error Trace:	.../tests/orchestrator/sandbox_memory_integrity_test.go:81
        	            				.../hostedtoolcache/go/1.26.3.../src/runtime/asm_amd64.s:1771
        	Error:      	Received unexpected error:
        	            	failed to execute command bash in sandbox ii54zzuzbbebmn4txkqsd: unavailable: HTTP status 502 Bad Gateway
    sandbox_memory_integrity_test.go:110: 
        	Error Trace:	.../tests/orchestrator/sandbox_memory_integrity_test.go:78
        	            				.../tests/orchestrator/sandbox_memory_integrity_test.go:110
        	Error:      	Condition never satisfied
        	Test:       	TestSandboxMemoryIntegrity/tmpfs_hash
--- FAIL: TestSandboxMemoryIntegrity/tmpfs_hash (201.11s)

To view more test analytics, go to the Test Analytics Dashboard
📋 Got 3 mins? Take this short survey to help us improve Test Analytics.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Creating and executing a Redis pipeline inside the loop in record defeats the purpose of pipelining and results in multiple sequential round-trips. Similarly, performing sequential Redis queries inside the loop in scores on the critical path of sandbox creation introduces unnecessary latency; both should be batched using a single pipeline outside the loops to improve performance.

Comment thread packages/api/internal/orchestrator/placement_affinity.go Outdated
Comment thread packages/api/internal/orchestrator/placement_affinity.go Outdated
Use a single pipeline for affinity reads and writes to keep the placement path bounded when the flag is enabled.
Use error-aware Redis checks and apply the repository formatter to the new feature flag.
Move cache affinity rollout knobs into the existing flag so Redis TTL, scoring, limits, and timeouts can be adjusted without a deploy.
Keep resume origin-node affinity as the primary signal and skip per-snapshot build keys so the Redis placement cache stays bounded.
Avoid Redis affinity lookups when resume already has a ready origin node and skip disabled affinity dimensions.
Use team-scoped template and base-template affinity as a low-cardinality fallback for resumes when origin-node placement is unavailable.
Track team-level placement history as the weakest cache-affinity fallback alongside template and base-template signals.
Keep template/team dimensions available for experiments but disable them by default until base build metadata is available.
Remove proxy affinity dimensions so the first rollout only uses concrete build cache signals.
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Autofix Details

Bugbot Autofix prepared a fix for the issue found in the latest run.

  • ✅ Fixed: Affinity node repeatedly selected despite ResourceExhausted failures
    • Added the ResourceExhausted node to nodesExcluded to prevent affinity augmentation from re-selecting it on retry.

Create PR

Or push these changes by commenting:

@cursor push bbadc5591a
Preview (bbadc5591a)
diff --git a/packages/api/internal/orchestrator/placement/placement.go b/packages/api/internal/orchestrator/placement/placement.go
--- a/packages/api/internal/orchestrator/placement/placement.go
+++ b/packages/api/internal/orchestrator/placement/placement.go
@@ -102,6 +102,7 @@
 
 		switch statusCode {
 		case codes.ResourceExhausted:
+			nodesExcluded[failedNode.ID] = struct{}{}
 			failedNode.PlacementMetrics.Skip(sbxRequest.GetSandbox().GetSandboxId())
 			logger.L().Warn(ctx, "Node exhausted, trying another node", logger.WithSandboxID(sbxRequest.GetSandbox().GetSandboxId()), logger.WithNodeID(failedNode.ID))
 		default:

You can send follow-ups to the cloud agent here.

Comment thread packages/api/internal/orchestrator/placement/placement_best_of_K.go
Exclude nodes that reject placement as exhausted so affinity retries can fall through to other candidates.
@ValentaTomas ValentaTomas marked this pull request as ready for review May 30, 2026 21:58
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Autofix Details

Bugbot Autofix prepared a fix for the issue found in the latest run.

  • ✅ Fixed: ResourceExhausted tight retry loop on last remaining node
    • Added explicit check to exit retry loop with 'no nodes available' error when the last remaining node returns ResourceExhausted, preventing infinite retries until context timeout.

Create PR

Or push these changes by commenting:

@cursor push 7334edb433
Preview (7334edb433)
diff --git a/packages/api/internal/orchestrator/placement/placement.go b/packages/api/internal/orchestrator/placement/placement.go
--- a/packages/api/internal/orchestrator/placement/placement.go
+++ b/packages/api/internal/orchestrator/placement/placement.go
@@ -104,6 +104,9 @@
 		case codes.ResourceExhausted:
 			if len(nodesExcluded)+1 < len(clusterNodes) {
 				nodesExcluded[failedNode.ID] = struct{}{}
+			} else {
+				// All nodes are exhausted; exit the retry loop.
+				return nil, errors.New("no nodes available")
 			}
 			failedNode.PlacementMetrics.Skip(sbxRequest.GetSandbox().GetSandboxId())
 			logger.L().Warn(ctx, "Node exhausted, trying another node", logger.WithSandboxID(sbxRequest.GetSandbox().GetSandboxId()), logger.WithNodeID(failedNode.ID))

You can send follow-ups to the cloud agent here.

Comment thread packages/api/internal/orchestrator/placement/placement.go Outdated
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 38d8022e3b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread packages/api/internal/orchestrator/create_instance.go Outdated
dobrac
dobrac previously requested changes May 30, 2026
Comment thread packages/api/internal/orchestrator/placement/placement_best_of_K.go
Comment thread packages/api/internal/orchestrator/placement/placement_best_of_K.go
Comment thread packages/api/internal/orchestrator/create_instance.go
Comment thread packages/api/internal/orchestrator/placement_affinity.go Outdated
Comment thread packages/api/internal/orchestrator/create_instance.go
Comment thread packages/api/internal/orchestrator/placement_affinity.go
Comment thread packages/api/internal/orchestrator/placement_affinity.go
Comment thread packages/api/internal/orchestrator/placement_affinity.go Outdated
Comment thread packages/api/internal/orchestrator/placement/placement_best_of_K.go
Comment thread packages/api/internal/orchestrator/client.go Outdated
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 75d9d736d4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread packages/api/internal/orchestrator/client.go Outdated
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Autofix Details

Bugbot Autofix prepared a fix for the issue found in the latest run.

  • ✅ Fixed: Affinity bonus not normalized to placement score scale
    • Normalized the affinity bonus by dividing it by totalCapacity to ensure consistent relative impact across nodes of different sizes.

Create PR

Or push these changes by commenting:

@cursor push 878be92bc5
Preview (878be92bc5)
diff --git a/packages/api/internal/orchestrator/placement/placement_best_of_K.go b/packages/api/internal/orchestrator/placement/placement_best_of_K.go
--- a/packages/api/internal/orchestrator/placement/placement_best_of_K.go
+++ b/packages/api/internal/orchestrator/placement/placement_best_of_K.go
@@ -63,7 +63,7 @@
 
 	score := (cpuRequested + float64(reserved) + config.Alpha*usageAvg) / totalCapacity
 	if len(affinityScores) > 0 {
-		score -= affinityScores[0][node.ID]
+		score -= affinityScores[0][node.ID] / totalCapacity
 	}
 
 	return score

You can send follow-ups to the cloud agent here.

Comment thread packages/api/internal/orchestrator/placement/placement_best_of_K.go
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Autofix Details

Bugbot Autofix prepared a fix for the issue found in the latest run.

  • ✅ Fixed: Exhausted nodes never cleared when hard failure fills threshold
    • Added logic to check and clear nodesExhausted before returning 'no nodes available', ensuring capacity-exhausted nodes are retried even when a hard failure fills the threshold.

Create PR

Or push these changes by commenting:

@cursor push 61822754ac
Preview (61822754ac)
diff --git a/packages/api/internal/orchestrator/placement/placement.go b/packages/api/internal/orchestrator/placement/placement.go
--- a/packages/api/internal/orchestrator/placement/placement.go
+++ b/packages/api/internal/orchestrator/placement/placement.go
@@ -62,7 +62,12 @@
 				skip[id] = struct{}{}
 			}
 			if len(skip) >= len(clusterNodes) {
-				return nil, errors.New("no nodes available")
+				if len(nodesExhausted) > 0 {
+					clear(nodesExhausted)
+					attempt++
+				} else {
+					return nil, errors.New("no nodes available")
+				}
 			}
 
 			node, err = algorithm.chooseNode(ctx, clusterNodes, skip, nodemanager.SandboxResources{CPUs: sbxRequest.GetSandbox().GetVcpu(), MiBMemory: sbxRequest.GetSandbox().GetRamMb()}, buildMachineInfo, labelFilteringEnabled, requiredLabels, affinityScores...)

You can send follow-ups to the cloud agent here.

Comment thread packages/api/internal/orchestrator/placement/placement.go Outdated
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8a6080cc64

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread packages/api/internal/orchestrator/cache.go Outdated
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

Autofix Details

Bugbot Autofix prepared a fix for the issue found in the latest run.

  • ✅ Fixed: Exhausted-pool retry never triggers with ineligible cluster nodes
    • Fixed by counting only eligible nodes (passing status, CPU, and label checks) instead of all cluster nodes when determining if the exhausted pool should be retried.

Create PR

Or push these changes by commenting:

@cursor push 6e8cf6cf45
Preview (6e8cf6cf45)
diff --git a/packages/api/internal/orchestrator/placement/placement.go b/packages/api/internal/orchestrator/placement/placement.go
--- a/packages/api/internal/orchestrator/placement/placement.go
+++ b/packages/api/internal/orchestrator/placement/placement.go
@@ -10,8 +10,11 @@
 	"google.golang.org/grpc/codes"
 	"google.golang.org/grpc/status"
 
+	"github.com/e2b-dev/infra/packages/api/internal/api"
 	"github.com/e2b-dev/infra/packages/api/internal/orchestrator/nodemanager"
 	"github.com/e2b-dev/infra/packages/api/internal/utils"
+	"github.com/e2b-dev/infra/packages/shared/pkg/consts"
+	"github.com/e2b-dev/infra/packages/shared/pkg/env"
 	"github.com/e2b-dev/infra/packages/shared/pkg/grpc/orchestrator"
 	"github.com/e2b-dev/infra/packages/shared/pkg/logger"
 	"github.com/e2b-dev/infra/packages/shared/pkg/machineinfo"
@@ -22,6 +25,41 @@
 
 var errSandboxCreateFailed = errors.New("failed to create a new sandbox, if the problem persists, contact us")
 
+// countEligibleNodes returns the count of nodes that pass basic static eligibility criteria.
+// This excludes nodes with wrong status, incompatible CPU, or mismatched labels—factors
+// that don't change during placement retries.
+func countEligibleNodes(clusterNodes []*nodemanager.Node, buildMachineInfo machineinfo.MachineInfo, labelFilteringEnabled bool, requiredLabels []string) int {
+	count := 0
+	for _, node := range clusterNodes {
+		if isNodeEligible(node, buildMachineInfo, labelFilteringEnabled, requiredLabels) {
+			count++
+		}
+	}
+	return count
+}
+
+// isNodeEligible checks static eligibility criteria that don't change during retries.
+// These include node status, CPU compatibility, and label matching.
+func isNodeEligible(node *nodemanager.Node, buildMachineInfo machineinfo.MachineInfo, filterByLabels bool, requiredLabels []string) bool {
+	// Local nodes are synthetic and bypass status/label checks
+	if env.IsLocal() && node.ClusterID == consts.LocalClusterID {
+		return true
+	}
+	// Node must be ready
+	if node.Status() != api.NodeStatusReady {
+		return false
+	}
+	// CPU must be compatible
+	if !isNodeCPUCompatible(node, buildMachineInfo) {
+		return false
+	}
+	// Labels must match if filtering is enabled
+	if filterByLabels && !isNodeLabelsCompatible(node, requiredLabels) {
+		return false
+	}
+	return true
+}
+
 // Algorithm defines the interface for sandbox placement strategies.
 // Implementations should choose an optimal node based on available resources
 // and current load distribution.
@@ -42,6 +80,7 @@
 		node = preferredNode
 	}
 
+	eligibleNodeCount := countEligibleNodes(clusterNodes, buildMachineInfo, labelFilteringEnabled, requiredLabels)
 	attempt := 0
 	for attempt < maxRetries {
 		select {
@@ -61,7 +100,7 @@
 			for id := range nodesExhausted {
 				skip[id] = struct{}{}
 			}
-			if len(skip) >= len(clusterNodes) {
+			if len(skip) >= eligibleNodeCount {
 				return nil, errors.New("no nodes available")
 			}
 
@@ -114,7 +153,7 @@
 			failedNode.PlacementMetrics.Skip(sbxRequest.GetSandbox().GetSandboxId())
 			// Once every node is excluded but some were only capacity-exhausted,
 			// retry the whole exhausted pool since capacity may free up.
-			if len(nodesExcluded)+len(nodesExhausted) >= len(clusterNodes) {
+			if len(nodesExcluded)+len(nodesExhausted) >= eligibleNodeCount {
 				clear(nodesExhausted)
 				attempt++
 			}

You can send follow-ups to the cloud agent here.

Comment thread packages/api/internal/orchestrator/placement/placement.go Outdated
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Autofix Details

Bugbot Autofix prepared a fix for the issue found in the latest run.

  • ✅ Fixed: Nil affinity map always triggers Score branch needlessly
    • Added nil check to avoid passing nil map as variadic argument, preventing unnecessary affinity scoring branch execution when feature is disabled.

Create PR

Or push these changes by commenting:

@cursor push 9d641d9600
Preview (9d641d9600)
diff --git a/packages/api/internal/orchestrator/create_instance.go b/packages/api/internal/orchestrator/create_instance.go
--- a/packages/api/internal/orchestrator/create_instance.go
+++ b/packages/api/internal/orchestrator/create_instance.go
@@ -337,7 +337,11 @@
 		affinityScores = o.placementAffinity.scores(ctx, placementCacheAffinityConfig, nodeClusterID, affinityBuildID)
 	}
 
-	node, err = placement.PlaceSandbox(ctx, o.placementAlgorithm, clusterNodes, node, sbxRequest, builds.ToMachineInfo(sbxData.Build), labelFilteringEnabled, team.SandboxSchedulingLabels, affinityScores)
+	if affinityScores != nil {
+		node, err = placement.PlaceSandbox(ctx, o.placementAlgorithm, clusterNodes, node, sbxRequest, builds.ToMachineInfo(sbxData.Build), labelFilteringEnabled, team.SandboxSchedulingLabels, affinityScores)
+	} else {
+		node, err = placement.PlaceSandbox(ctx, o.placementAlgorithm, clusterNodes, node, sbxRequest, builds.ToMachineInfo(sbxData.Build), labelFilteringEnabled, team.SandboxSchedulingLabels)
+	}
 	if err != nil {
 		return sandbox.Sandbox{}, &api.APIError{
 			Code:      http.StatusInternalServerError,

You can send follow-ups to the cloud agent here.

Comment thread packages/api/internal/orchestrator/create_instance.go Outdated
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fdb1b1e768

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread packages/api/internal/orchestrator/client.go Outdated
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 53bfe748ae

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread packages/api/internal/orchestrator/placement/placement.go
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9277fc3cd9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread packages/api/internal/orchestrator/placement/placement.go Outdated
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Autofix Details

Bugbot Autofix prepared a fix for the issue found in the latest run.

  • ✅ Fixed: Exhausted retry counter unreachable due to attempt increment
    • Removed the attempt++ increment from the exhausted pool retry path so maxExhaustedRetries (100) can be reached before the outer loop terminates at maxRetries (3).

Create PR

Or push these changes by commenting:

@cursor push 9277fc3cd9

You can send follow-ups to the cloud agent here.

Comment thread packages/api/internal/orchestrator/placement/placement.go Outdated
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 202bab4374

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread packages/api/internal/orchestrator/create_instance.go Outdated
@ValentaTomas ValentaTomas requested a review from dobrac May 31, 2026 05:08
@ValentaTomas ValentaTomas dismissed dobrac’s stale review May 31, 2026 05:08

Rerequested review.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3d362021fe

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread packages/api/internal/orchestrator/placement/placement_best_of_K.go Outdated
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix prepared a fix for the issue found in the latest run.

  • ✅ Fixed: Affinity bonus not normalized by totalCapacity in Score
    • Normalized the affinity bonus by dividing it by totalCapacity to match the base score normalization, ensuring consistent placement behavior across nodes with different CPU counts.

Create PR

Or push these changes by commenting:

@cursor push b7048af342
Preview (b7048af342)
diff --git a/packages/api/internal/orchestrator/placement/placement_best_of_K.go b/packages/api/internal/orchestrator/placement/placement_best_of_K.go
--- a/packages/api/internal/orchestrator/placement/placement_best_of_K.go
+++ b/packages/api/internal/orchestrator/placement/placement_best_of_K.go
@@ -63,7 +63,7 @@
 
 	score := (cpuRequested + float64(reserved) + config.Alpha*usageAvg) / totalCapacity
 	if len(affinityScores) > 0 {
-		score -= affinityScores[0][node.ID]
+		score -= affinityScores[0][node.ID] / totalCapacity
 	}
 
 	return score

You can send follow-ups to the cloud agent here.

Reviewed by Cursor Bugbot for commit b9916c4. Configure here.

score := (cpuRequested + float64(reserved) + config.Alpha*usageAvg) / totalCapacity
if len(affinityScores) > 0 {
score -= affinityScores[0][node.ID]
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Affinity bonus not normalized by totalCapacity in Score

Medium Severity

The base placement score is divided by totalCapacity (R * cpuCount), but the affinity bonus subtraction on line 66 is applied as a raw absolute value without the same normalization. The PR discussion explicitly states this was fixed ("normalizing the affinity bonus in Score on the same totalCapacity denominator as the base score"), but the code subtracts affinityScores[0][node.ID] directly from the already-normalized score. This causes the affinity bonus to have a disproportionately larger relative effect on high-CPU nodes (where base scores are smaller) compared to low-CPU nodes.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit b9916c4. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants