Implement state service on top of etcd. by EngHabu · Pull Request #6902 · flyteorg/flyte

EngHabu · 2026-02-08T16:36:43Z

Signed-off-by: Haytham Abuelfutuh haytham@afutuh.com

Tracking issue

Why are the changes needed?

What changes were proposed in this pull request?

How was this patch tested?

Labels

Please add one or more of the following labels to categorize your PR:

added: For new features.
changed: For changes in existing functionality.
deprecated: For soon-to-be-removed features.
removed: For features being removed.
fixed: For any bug fixed.
security: In case of vulnerabilities

This is important to improve the readability of release notes.

Setup process

Screenshots

Check all the applicable boxes

I updated the documentation accordingly.
All new and existing tests passed.
All commits are signed-off.

Related PRs

Docs link

main
- Flyte 2 WIP #6583
  - Fix vet issues #6901
    - Implement state service on top of etcd. #6902 👈
      - Initial executor <> Plugins integration #6903
        
        Switch to pflags #6904

Signed-off-by: Haytham Abuelfutuh <haytham@afutuh.com>

popojk · 2026-02-23T07:37:41Z

state/service/state_service.go

-
-	go s.repo.ActionRepo().WatchStateUpdates(ctx, updates, errs)
+	// Subscribe to updates
+	updateCh := s.k8sClient.Subscribe()


Should we subscribe first then call list API for the first batch of data? Otherwise some data might be missing between the time window?

popojk · 2026-02-23T08:03:58Z

state/service/state_service.go

-	if err := s.repo.ActionRepo().NotifyStateUpdate(ctx, msg.ActionId); err != nil {
-		logger.Warnf(ctx, "Failed to send state update notification: %v", err)
-		// Continue anyway - the update was saved
-	}


I think here instead we should send a notification to run service so that the run service can update DB and show latest action status in the UI.

It would be better to create a new goroutine to subscribe to the state client (like what we did in the state service Watch). Every time there's an update, we can then run UpdateActionState and NotifyStateUpdate

popojk · 2026-02-23T08:10:53Z

state/service/state_service.go

+	// Must be same run
 	if actionID.Run.Org != parentActionID.Run.Org ||
 		actionID.Run.Project != parentActionID.Run.Project ||
 		actionID.Run.Domain != parentActionID.Run.Domain ||
 		actionID.Run.Name != parentActionID.Run.Name {
 		return false
 	}

-	// For now, we'll include all actions in the run
-	// In production, you'd check the parent relationship
+	// For now, include all actions in the same run
+	// A more sophisticated implementation would check the parent-child relationship
 	return true


nit:

Suggested change

// Must be same run

if actionID.Run.Org != parentActionID.Run.Org ||

actionID.Run.Project != parentActionID.Run.Project ||

actionID.Run.Domain != parentActionID.Run.Domain ||

actionID.Run.Name != parentActionID.Run.Name {

return false

}

// For now, we'll include all actions in the run

// In production, you'd check the parent relationship

// For now, include all actions in the same run

// A more sophisticated implementation would check the parent-child relationship

return true

// Must be same run

// For now, include all actions in the same run

// A more sophisticated implementation would check the parent-child relationship

return actionID.Run.Org == parentActionID.Run.Org &&

actionID.Run.Project == parentActionID.Run.Project &&

actionID.Run.Domain == parentActionID.Run.Domain &&

actionID.Run.Name == parentActionID.Run.Name

popojk · 2026-02-23T08:18:31Z

state/service/state_service.go

-	// Update action state in database
-	if err := s.repo.ActionRepo().UpdateActionState(ctx, msg.ActionId, msg.State); err != nil {
+	// Update TaskAction state in Kubernetes
+	if err := s.k8sClient.PutState(ctx, msg.ActionId, msg.State); err != nil {


nit: An idea that we can consider to implement in another PR. I suggest that we store the updated status in a local cache after k8s CR is updated. The Get/Watch request can retrieve data from cache first to prevent a possible bottleneck in k8s API.

I think we can directly get from informer cache and do not need to implement this on our own. This can be improved in the future PR

popojk · 2026-02-23T08:25:36Z

state/k8s/client.go

+	}
+
+	// Update state JSON
+	taskAction.Status.StateJSON = stateJSON


Should we check if current state equals to previous state, and skip update if true?

popojk · 2026-02-23T08:51:54Z

state/k8s/client.go

+	defer c.mu.Unlock()
+
+	ch := make(chan *ActionUpdate, c.bufferSize)
+	c.subscribers[ch] = struct{}{}


The watch action API request was scoped by parent action ID. Should we maintain a {parent action ID: subscribers} map here here? The update should only be notified to subscribers listening on updated action parent ID

Make sense to me!

machichima · 2026-02-24T06:14:36Z

manager/cmd/main.go

-		return fmt.Errorf("failed to initialize scheme: %w", err)
-	}
+	// Create a client.Client from the WithWatch client for services that don't need watch
+	var regularK8sClient client.Client = k8sClient


Suggested change

var regularK8sClient client.Client = k8sClient

var k8sClientWithoutWatch client.Client = k8sClient

nit: make it clear that this k8s client do not have watch support

machichima · 2026-02-24T07:25:59Z

state/k8s/client.go

+		select {
+		case ch <- update:
+		default:
+			// Channel full, skip (non-blocking)


If the channel full, the update can lost. Based on the proto, Watch should guarantee at-least-once

flyte/flyteidl2/workflow/state_service.proto

Lines 21 to 22 in 78afd14

// watch for updates to the state of actions. this api guarantees at-least-once delivery semantics.

rpc Watch(WatchRequest) returns (stream WatchResponse) {}

I think we can just leave a error log for now, add TODO, and can handle this in the future if needed

machichima · 2026-02-24T07:28:42Z

manager/cmd/main.go

 	logger.Infof(ctx, "Kubernetes client initialized for namespace: %s", cfg.Kubernetes.Namespace)

+	// Create state client (K8s-based, for watching TaskAction CRs)
+	stateK8sClient := statek8s.NewStateClient(k8sClient, cfg.Kubernetes.Namespace, 100)


Related to https://github.com/flyteorg/flyte/pull/6902/changes#r2845032983

What might be the best default value for buffer size? and should we make this configurable?

EngHabu added 4 commits January 24, 2026 09:08

Implement state service on top of etcd.

8bac01e

Signed-off-by: Haytham Abuelfutuh <haytham@afutuh.com>

Merge branch 'v2' into enghabu/state-etcd

fcbbc23

Merge branch 'v2' into enghabu/state-etcd

781120c

Merge branch 'enghabu/vet' into enghabu/state-etcd

78afd14

This was referenced Feb 8, 2026

Fix vet issues #6901

Open

Flyte 2 WIP #6583

Draft

Initial executor <> Plugins integration #6903

Draft

Switch to pflags #6904

Draft

popojk reviewed Feb 23, 2026

View reviewed changes

machichima reviewed Feb 24, 2026

View reviewed changes

Merge branch 'enghabu/vet' into enghabu/state-etcd

936ce0d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Implement state service on top of etcd.#6902

Implement state service on top of etcd.#6902
EngHabu wants to merge 5 commits intoenghabu/vetfrom
enghabu/state-etcd

EngHabu commented Feb 8, 2026 •

edited by github-actions bot

Loading

Uh oh!

popojk Feb 23, 2026

Uh oh!

popojk Feb 23, 2026

Uh oh!

machichima Feb 24, 2026

Uh oh!

popojk Feb 23, 2026

Uh oh!

popojk Feb 23, 2026

Uh oh!

machichima Feb 24, 2026

Uh oh!

popojk Feb 23, 2026

Uh oh!

popojk Feb 23, 2026

Uh oh!

machichima Feb 24, 2026

Uh oh!

machichima Feb 24, 2026

Uh oh!

machichima Feb 24, 2026

Uh oh!

machichima Feb 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	var regularK8sClient client.Client = k8sClient
	var k8sClientWithoutWatch client.Client = k8sClient

	// watch for updates to the state of actions. this api guarantees at-least-once delivery semantics.
	rpc Watch(WatchRequest) returns (stream WatchResponse) {}

Comments

Conversation

EngHabu commented Feb 8, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Tracking issue

Why are the changes needed?

What changes were proposed in this pull request?

How was this patch tested?

Labels

Setup process

Screenshots

Check all the applicable boxes

Related PRs

Docs link

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

EngHabu commented Feb 8, 2026 •

edited by github-actions bot

Loading