Skip to content

Design doc: Replace docker exec gnoi_client with native gRPC calls#361

Open
sigabrtv1-ui wants to merge 9 commits into
sonic-net:masterfrom
sigabrtv1-ui:design/gnoi-native-grpc
Open

Design doc: Replace docker exec gnoi_client with native gRPC calls#361
sigabrtv1-ui wants to merge 9 commits into
sonic-net:masterfrom
sigabrtv1-ui:design/gnoi-native-grpc

Conversation

@sigabrtv1-ui
Copy link
Copy Markdown

Description

Design document proposing the replacement of docker exec gnmi gnoi_client subprocess calls in gnoi_shutdown_daemon with direct Python gRPC calls.

Key findings from analyzing the sonic-gnmi gnoi_client source:

  1. Error handling is a Go panic() — on any RPC failure, gnoi_client dumps a Go panic stack trace to stderr and exits non-zero. The daemon only checks the return code and logs a generic "command failed" with zero context. Diagnosing production failures requires SSH + manual docker exec + reading Go stack traces.

  2. RebootStatus parsing is broken — the daemon checks "reboot complete" in out_s.lower() but the actual output is JSON-marshaled protobuf ({"active":false,"status":{"status":"STATUS_SUCCESS"}}). The string "reboot complete" never appears, so the poll always times out regardless of DPU state.

  3. gnmi container dependency — if the gnmi container is unhealthy or restarting, DPU shutdown silently fails.

Proposed solution:

  • Vendor gNOI System proto stubs
  • Add a GnoiClient Python wrapper using grpcio
  • Refactor GnoiRebootHandler for direct gRPC with structured error handling
  • Fix the RebootStatus completion check (resp.active == False + resp.status.status)

Design doc: doc/gnoi-native-grpc-design.md
Tracking issue: #360

Proposes replacing subprocess-based gnoi_client invocations in
gnoi_shutdown_daemon with direct Python gRPC calls.

Documents the gnoi_client output format, a parsing bug in RebootStatus
polling, and the difficulty of diagnosing RPC failures through Go panic
stack traces.

Ref: sonic-net#360

Signed-off-by: Dawei Huang <daweihuang@microsoft.com>
@mssonicbld
Copy link
Copy Markdown

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a design document describing how gnoi_shutdown_daemon should be refactored to replace docker exec gnmi gnoi_client ... subprocess calls with native Python gRPC calls to the OpenConfig gNOI System service, improving reliability and error visibility across the host/container boundary.

Changes:

  • Documented current failure modes of the gnoi_client subprocess approach (panic-based errors, fragile stdout parsing, gnmi container dependency).
  • Proposed a phased implementation plan: vendor protobuf stubs, add a GnoiClient wrapper, refactor GnoiRebootHandler, and update unit tests.
  • Defined the expected success criteria for RebootStatus (based on response fields rather than stdout substrings).

Comment thread doc/gnoi-native-grpc-design.md Outdated
Comment on lines +219 to +223
if not resp.active:
status_str = system_pb2.RebootStatus.Status.Name(resp.status.status)
logger.log_notice(f"{dpu_name}: RebootStatus complete: {status_str} - {resp.status.message}")
return resp.status.status == system_pb2.RebootStatus.Status.STATUS_SUCCESS
except grpc.RpcError as e:
Copy link

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _poll_reboot_status example treats the completion status enum as system_pb2.RebootStatus.Status..., but earlier the doc notes types/types.proto is a dependency and the JSON example shows a nested status message. Please update the example to reference the actual enum type from the generated types_pb2 (and import it) so implementers don't copy an API that won't exist in the generated stubs.

Copilot uses AI. Check for mistakes.
Comment on lines +248 to +266
```python
@mock.patch('gnoi_shutdown_daemon.GnoiClient')
def test_send_reboot_command_success(self, MockClient):
mock_client = MockClient.return_value.__enter__.return_value
# reboot() returns None on success
mock_client.reboot.return_value = None

result = handler._send_reboot_command("DPU0", "10.0.0.1", "8080")
assert result is True
mock_client.reboot.assert_called_once()

@mock.patch('gnoi_shutdown_daemon.GnoiClient')
def test_send_reboot_command_failure(self, MockClient):
mock_client = MockClient.return_value.__enter__.return_value
mock_client.reboot.side_effect = grpc.RpcError()

result = handler._send_reboot_command("DPU0", "10.0.0.1", "8080")
assert result is False
```
Copy link

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test snippet uses mock_client.reboot.side_effect = grpc.RpcError(), but the refactor example calls e.code()/e.details() on the caught exception. A plain grpc.RpcError() instance typically won't provide those methods; please adjust the example to raise/mock an exception object that implements the grpc.Call interface (or explicitly mock code()/details()) so the proposed tests match the proposed logging behavior.

Copilot uses AI. Check for mistakes.
Comment thread doc/gnoi-native-grpc-design.md Outdated
Comment on lines +83 to +93
**Files to add:**
```
host_modules/gnoi/
├── __init__.py
├── system_pb2.py # generated message classes
└── system_pb2_grpc.py # generated service stubs
```

The stubs are generated from:
- https://github.com/openconfig/gnoi/blob/main/system/system.proto
- https://github.com/openconfig/gnoi/blob/main/types/types.proto (dependency)
Copy link

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The stub file list doesn't include the generated module for the types/types.proto dependency. If you vendor stubs generated from both system.proto and types.proto, you will also get (and need to ship/import) something like types_pb2.py (and potentially types_pb2_grpc.py if any services exist), otherwise the generated system_pb2.py will have an unresolved import.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — added types_pb2.py and types_pb2_grpc.py to the vendored stubs list.

hdwhdw added 2 commits March 21, 2026 04:05
…proto

Signed-off-by: Dawei Huang <daweihuang@microsoft.com>
…tion

Signed-off-by: Dawei Huang <daweihuang@microsoft.com>
@hdwhdw hdwhdw self-requested a review March 21, 2026 04:07
- Add types_pb2.py to vendored stubs file list (types.proto dependency)
- Fix RebootStatus enum access pattern for protobuf Python codegen
- Fix test mock to implement grpc.Call interface (code()/details())

Signed-off-by: Dawei Huang <daweihuang@microsoft.com>
@mssonicbld
Copy link
Copy Markdown

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@sigabrtv1-ui
Copy link
Copy Markdown
Author

Addressed all three Copilot review comments in the latest push:

  1. types_pb2.py missing (line 95): Added types_pb2.py and types_pb2_grpc.py to the vendored stubs file list.
  2. RebootStatus enum access (line 226): Updated to system_pb2.RebootStatus.STATUS_SUCCESS which matches protobuf Python codegen output.
  3. Test mock missing code()/details() (line 272): Now uses mock.create_autospec(grpc.RpcError) with code() and details() properly mocked.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.

Comment thread doc/gnoi-native-grpc-design.md Outdated
/sonic/gnoi_client/gnoi_client.go:42
...
```
The daemon captures this in `err` (stderr) but never logs or inspects it — it just logs `"Reboot command failed"` with no context. Diagnosing production failures requires SSHing into the switch, manually running the docker exec command, and reading Go stack traces.
Copy link

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The statement that the daemon “captures this in err (stderr)” is not accurate for the Reboot call: _send_reboot_command() invokes execute_command(..., suppress_stderr=True), so stderr (including the Go panic) is intentionally discarded. Consider rewording to reflect that stderr is suppressed/ignored today, which is why the daemon logs have no actionable context.

Suggested change
The daemon captures this in `err` (stderr) but never logs or inspects it — it just logs `"Reboot command failed"` with no context. Diagnosing production failures requires SSHing into the switch, manually running the docker exec command, and reading Go stack traces.
For the Reboot call, `_send_reboot_command()` invokes `execute_command(..., suppress_stderr=True)`, so this panic output on stderr is suppressed rather than logged or inspected — the daemon just logs `"Reboot command failed"` with no actionable context. Diagnosing production failures requires SSHing into the switch, manually running the docker exec command, and reading Go stack traces.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — rewording now reflects that _send_reboot_command() uses suppress_stderr=True, so the panic output is suppressed rather than captured.

Comment thread doc/gnoi-native-grpc-design.md Outdated
Comment on lines +85 to +89
host_modules/gnoi/
├── __init__.py
├── system_pb2.py # generated message classes
├── system_pb2_grpc.py # generated service stubs
├── types_pb2.py # generated types (dependency of system_pb2)
Copy link

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The design proposes adding host_modules/gnoi/ as a new Python package, but setup.py currently lists packages explicitly (['host_modules', 'utils']) rather than using find_packages(). Without updating packaging (e.g., include host_modules.gnoi or switch to package discovery), the vendored stubs/client won’t be installed and imports will fail in deployed environments. Please call this out in the design (and implementation plan).

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — added a packaging note calling out that setup.py must include host_modules.gnoi in the packages list (and package_dir), since it uses explicit listing rather than find_packages().

- Correct stderr statement: _send_reboot_command uses suppress_stderr=True,
  so panic output is suppressed, not captured
- Add packaging note: setup.py must include host_modules.gnoi in packages
  list for vendored stubs to be installed

Signed-off-by: Dawei Huang <daweihuang@microsoft.com>
@mssonicbld
Copy link
Copy Markdown

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.

Comment thread doc/gnoi-native-grpc-design.md Outdated
Comment on lines +83 to +104
**Files to add:**
```
host_modules/gnoi/
├── __init__.py
├── system_pb2.py # generated message classes
├── system_pb2_grpc.py # generated service stubs
├── types_pb2.py # generated types (dependency of system_pb2)
└── types_pb2_grpc.py # generated (empty, no services in types.proto)
```

The stubs are generated from:
- https://github.com/openconfig/gnoi/blob/main/system/system.proto
- https://github.com/openconfig/gnoi/blob/main/types/types.proto (dependency)

Generation command (for reference / CI reproducibility):
```bash
python -m grpc_tools.protoc \
-I./proto \
--python_out=host_modules/gnoi \
--grpc_python_out=host_modules/gnoi \
system/system.proto types/types.proto
```
Copy link

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proposed stub file layout (system_pb2.py/types_pb2.py directly under host_modules/gnoi/) conflicts with the stated protoc invocation using system/system.proto and types/types.proto. With those input paths, protoc will normally emit modules under system/ and types/ subdirectories, and the wrapper import examples would need to match (or the generation command/proto paths need to be adjusted to produce flat modules). Please make the file list, generation command, and example imports consistent so an implementer can follow this doc without hitting import/package path issues.

Copilot uses AI. Check for mistakes.
Comment thread doc/gnoi-native-grpc-design.md Outdated

**RebootStatus RPC (`-rpc RebootStatus`):**
- On **success**: prints `"System RebootStatus\n"` header followed by JSON-marshaled `RebootStatusResponse`, e.g.:
```json
Copy link

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RebootStatus output example is tagged as a json code block but includes a non-JSON header line (System RebootStatus). Consider changing the fence language to text (or separating the header from the JSON) so the example is syntactically accurate and easier to copy/paste for debugging.

Suggested change
```json
```text

Copilot uses AI. Check for mistakes.
Section 8 now links to gnoi_client, reboot.go, and system.proto
instead of inlining full source code.

Signed-off-by: Dawei Huang <daweihuang@microsoft.com>
@mssonicbld
Copy link
Copy Markdown

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

…rplate

Signed-off-by: Dawei Huang <daweihuang@microsoft.com>
@mssonicbld
Copy link
Copy Markdown

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Signed-off-by: Dawei Huang <daweihuang@microsoft.com>
@mssonicbld
Copy link
Copy Markdown

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Copy Markdown

/azp run

@sigabrtv1-ui sigabrtv1-ui force-pushed the design/gnoi-native-grpc branch from a9f8d44 to c6f1fe0 Compare March 23, 2026 16:36
@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Copy Markdown

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@sigabrtv1-ui sigabrtv1-ui force-pushed the design/gnoi-native-grpc branch from c6f1fe0 to 0cfe63f Compare March 23, 2026 16:38
@mssonicbld
Copy link
Copy Markdown

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@sigabrtv1-ui sigabrtv1-ui force-pushed the design/gnoi-native-grpc branch from 0cfe63f to 606230e Compare March 23, 2026 16:40
@mssonicbld
Copy link
Copy Markdown

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.


| Risk | Mitigation |
|------|------------|
| grpcio/protobuf not in host environment | Already used by other SONiC components |
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The risk/mitigation table says grpcio/protobuf are “already used by other SONiC components”, but this repo’s setup.py doesn’t currently declare grpcio / protobuf as runtime dependencies. The design should explicitly call out that the implementation must add these dependencies (and any Debian packaging equivalents) or otherwise ensure they’re present on the host image.

Suggested change
| grpcio/protobuf not in host environment | Already used by other SONiC components |
| grpcio/protobuf not in host environment | Add `grpcio` and `protobuf` as runtime dependencies in this repo's `setup.py` and Debian packaging, or otherwise guarantee they are present on the host image |

Copilot uses AI. Check for mistakes.
Comment on lines +92 to +101
New directory structure:
```
host_modules/gnoi/
├── __init__.py
├── client.py # GnoiClient: reboot(), reboot_status(), context manager
├── system_pb2.py # vendored from openconfig/gnoi system.proto
├── system_pb2_grpc.py
├── types_pb2.py # dependency of system.proto
└── types_pb2_grpc.py
```
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The design proposes adding a new subpackage under host_modules/gnoi/, but this doc doesn’t call out that setup.py uses an explicit packages=[...] list. Without explicitly adding host_modules.gnoi (or switching to package discovery), the new client/stubs won’t be installed in deployed environments and imports will fail. Please document this packaging requirement in the scope/implementation plan section.

Copilot uses AI. Check for mistakes.
@sigabrtv1-ui sigabrtv1-ui force-pushed the design/gnoi-native-grpc branch from 606230e to a4f3f04 Compare March 23, 2026 16:49
@mssonicbld
Copy link
Copy Markdown

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Signed-off-by: Dawei Huang <daweihuang@microsoft.com>
@sigabrtv1-ui sigabrtv1-ui force-pushed the design/gnoi-native-grpc branch from a4f3f04 to eb0cf40 Compare March 23, 2026 16:58
@mssonicbld
Copy link
Copy Markdown

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).


New directory structure:
```
host_modules/gnoi/
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hdwhdw Do you plan to implement this in sonic-host-services ? if available it can be used from containers like pmon

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants