Design doc: Replace docker exec gnoi_client with native gRPC calls#361
Design doc: Replace docker exec gnoi_client with native gRPC calls#361sigabrtv1-ui wants to merge 9 commits into
Conversation
Proposes replacing subprocess-based gnoi_client invocations in gnoi_shutdown_daemon with direct Python gRPC calls. Documents the gnoi_client output format, a parsing bug in RebootStatus polling, and the difficulty of diagnosing RPC failures through Go panic stack traces. Ref: sonic-net#360 Signed-off-by: Dawei Huang <daweihuang@microsoft.com>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Pull request overview
Adds a design document describing how gnoi_shutdown_daemon should be refactored to replace docker exec gnmi gnoi_client ... subprocess calls with native Python gRPC calls to the OpenConfig gNOI System service, improving reliability and error visibility across the host/container boundary.
Changes:
- Documented current failure modes of the
gnoi_clientsubprocess approach (panic-based errors, fragile stdout parsing, gnmi container dependency). - Proposed a phased implementation plan: vendor protobuf stubs, add a
GnoiClientwrapper, refactorGnoiRebootHandler, and update unit tests. - Defined the expected success criteria for
RebootStatus(based on response fields rather than stdout substrings).
| if not resp.active: | ||
| status_str = system_pb2.RebootStatus.Status.Name(resp.status.status) | ||
| logger.log_notice(f"{dpu_name}: RebootStatus complete: {status_str} - {resp.status.message}") | ||
| return resp.status.status == system_pb2.RebootStatus.Status.STATUS_SUCCESS | ||
| except grpc.RpcError as e: |
There was a problem hiding this comment.
The _poll_reboot_status example treats the completion status enum as system_pb2.RebootStatus.Status..., but earlier the doc notes types/types.proto is a dependency and the JSON example shows a nested status message. Please update the example to reference the actual enum type from the generated types_pb2 (and import it) so implementers don't copy an API that won't exist in the generated stubs.
| ```python | ||
| @mock.patch('gnoi_shutdown_daemon.GnoiClient') | ||
| def test_send_reboot_command_success(self, MockClient): | ||
| mock_client = MockClient.return_value.__enter__.return_value | ||
| # reboot() returns None on success | ||
| mock_client.reboot.return_value = None | ||
|
|
||
| result = handler._send_reboot_command("DPU0", "10.0.0.1", "8080") | ||
| assert result is True | ||
| mock_client.reboot.assert_called_once() | ||
|
|
||
| @mock.patch('gnoi_shutdown_daemon.GnoiClient') | ||
| def test_send_reboot_command_failure(self, MockClient): | ||
| mock_client = MockClient.return_value.__enter__.return_value | ||
| mock_client.reboot.side_effect = grpc.RpcError() | ||
|
|
||
| result = handler._send_reboot_command("DPU0", "10.0.0.1", "8080") | ||
| assert result is False | ||
| ``` |
There was a problem hiding this comment.
The test snippet uses mock_client.reboot.side_effect = grpc.RpcError(), but the refactor example calls e.code()/e.details() on the caught exception. A plain grpc.RpcError() instance typically won't provide those methods; please adjust the example to raise/mock an exception object that implements the grpc.Call interface (or explicitly mock code()/details()) so the proposed tests match the proposed logging behavior.
| **Files to add:** | ||
| ``` | ||
| host_modules/gnoi/ | ||
| ├── __init__.py | ||
| ├── system_pb2.py # generated message classes | ||
| └── system_pb2_grpc.py # generated service stubs | ||
| ``` | ||
|
|
||
| The stubs are generated from: | ||
| - https://github.com/openconfig/gnoi/blob/main/system/system.proto | ||
| - https://github.com/openconfig/gnoi/blob/main/types/types.proto (dependency) |
There was a problem hiding this comment.
The stub file list doesn't include the generated module for the types/types.proto dependency. If you vendor stubs generated from both system.proto and types.proto, you will also get (and need to ship/import) something like types_pb2.py (and potentially types_pb2_grpc.py if any services exist), otherwise the generated system_pb2.py will have an unresolved import.
There was a problem hiding this comment.
Fixed — added types_pb2.py and types_pb2_grpc.py to the vendored stubs list.
…proto Signed-off-by: Dawei Huang <daweihuang@microsoft.com>
…tion Signed-off-by: Dawei Huang <daweihuang@microsoft.com>
- Add types_pb2.py to vendored stubs file list (types.proto dependency) - Fix RebootStatus enum access pattern for protobuf Python codegen - Fix test mock to implement grpc.Call interface (code()/details()) Signed-off-by: Dawei Huang <daweihuang@microsoft.com>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
Addressed all three Copilot review comments in the latest push:
|
| /sonic/gnoi_client/gnoi_client.go:42 | ||
| ... | ||
| ``` | ||
| The daemon captures this in `err` (stderr) but never logs or inspects it — it just logs `"Reboot command failed"` with no context. Diagnosing production failures requires SSHing into the switch, manually running the docker exec command, and reading Go stack traces. |
There was a problem hiding this comment.
The statement that the daemon “captures this in err (stderr)” is not accurate for the Reboot call: _send_reboot_command() invokes execute_command(..., suppress_stderr=True), so stderr (including the Go panic) is intentionally discarded. Consider rewording to reflect that stderr is suppressed/ignored today, which is why the daemon logs have no actionable context.
| The daemon captures this in `err` (stderr) but never logs or inspects it — it just logs `"Reboot command failed"` with no context. Diagnosing production failures requires SSHing into the switch, manually running the docker exec command, and reading Go stack traces. | |
| For the Reboot call, `_send_reboot_command()` invokes `execute_command(..., suppress_stderr=True)`, so this panic output on stderr is suppressed rather than logged or inspected — the daemon just logs `"Reboot command failed"` with no actionable context. Diagnosing production failures requires SSHing into the switch, manually running the docker exec command, and reading Go stack traces. |
There was a problem hiding this comment.
Fixed — rewording now reflects that _send_reboot_command() uses suppress_stderr=True, so the panic output is suppressed rather than captured.
| host_modules/gnoi/ | ||
| ├── __init__.py | ||
| ├── system_pb2.py # generated message classes | ||
| ├── system_pb2_grpc.py # generated service stubs | ||
| ├── types_pb2.py # generated types (dependency of system_pb2) |
There was a problem hiding this comment.
The design proposes adding host_modules/gnoi/ as a new Python package, but setup.py currently lists packages explicitly (['host_modules', 'utils']) rather than using find_packages(). Without updating packaging (e.g., include host_modules.gnoi or switch to package discovery), the vendored stubs/client won’t be installed and imports will fail in deployed environments. Please call this out in the design (and implementation plan).
There was a problem hiding this comment.
Good catch — added a packaging note calling out that setup.py must include host_modules.gnoi in the packages list (and package_dir), since it uses explicit listing rather than find_packages().
- Correct stderr statement: _send_reboot_command uses suppress_stderr=True, so panic output is suppressed, not captured - Add packaging note: setup.py must include host_modules.gnoi in packages list for vendored stubs to be installed Signed-off-by: Dawei Huang <daweihuang@microsoft.com>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
| **Files to add:** | ||
| ``` | ||
| host_modules/gnoi/ | ||
| ├── __init__.py | ||
| ├── system_pb2.py # generated message classes | ||
| ├── system_pb2_grpc.py # generated service stubs | ||
| ├── types_pb2.py # generated types (dependency of system_pb2) | ||
| └── types_pb2_grpc.py # generated (empty, no services in types.proto) | ||
| ``` | ||
|
|
||
| The stubs are generated from: | ||
| - https://github.com/openconfig/gnoi/blob/main/system/system.proto | ||
| - https://github.com/openconfig/gnoi/blob/main/types/types.proto (dependency) | ||
|
|
||
| Generation command (for reference / CI reproducibility): | ||
| ```bash | ||
| python -m grpc_tools.protoc \ | ||
| -I./proto \ | ||
| --python_out=host_modules/gnoi \ | ||
| --grpc_python_out=host_modules/gnoi \ | ||
| system/system.proto types/types.proto | ||
| ``` |
There was a problem hiding this comment.
The proposed stub file layout (system_pb2.py/types_pb2.py directly under host_modules/gnoi/) conflicts with the stated protoc invocation using system/system.proto and types/types.proto. With those input paths, protoc will normally emit modules under system/ and types/ subdirectories, and the wrapper import examples would need to match (or the generation command/proto paths need to be adjusted to produce flat modules). Please make the file list, generation command, and example imports consistent so an implementer can follow this doc without hitting import/package path issues.
|
|
||
| **RebootStatus RPC (`-rpc RebootStatus`):** | ||
| - On **success**: prints `"System RebootStatus\n"` header followed by JSON-marshaled `RebootStatusResponse`, e.g.: | ||
| ```json |
There was a problem hiding this comment.
The RebootStatus output example is tagged as a json code block but includes a non-JSON header line (System RebootStatus). Consider changing the fence language to text (or separating the header from the JSON) so the example is syntactically accurate and easier to copy/paste for debugging.
| ```json | |
| ```text |
Section 8 now links to gnoi_client, reboot.go, and system.proto instead of inlining full source code. Signed-off-by: Dawei Huang <daweihuang@microsoft.com>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
…rplate Signed-off-by: Dawei Huang <daweihuang@microsoft.com>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Signed-off-by: Dawei Huang <daweihuang@microsoft.com>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
a9f8d44 to
c6f1fe0
Compare
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
c6f1fe0 to
0cfe63f
Compare
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
0cfe63f to
606230e
Compare
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
|
||
| | Risk | Mitigation | | ||
| |------|------------| | ||
| | grpcio/protobuf not in host environment | Already used by other SONiC components | |
There was a problem hiding this comment.
The risk/mitigation table says grpcio/protobuf are “already used by other SONiC components”, but this repo’s setup.py doesn’t currently declare grpcio / protobuf as runtime dependencies. The design should explicitly call out that the implementation must add these dependencies (and any Debian packaging equivalents) or otherwise ensure they’re present on the host image.
| | grpcio/protobuf not in host environment | Already used by other SONiC components | | |
| | grpcio/protobuf not in host environment | Add `grpcio` and `protobuf` as runtime dependencies in this repo's `setup.py` and Debian packaging, or otherwise guarantee they are present on the host image | |
| New directory structure: | ||
| ``` | ||
| host_modules/gnoi/ | ||
| ├── __init__.py | ||
| ├── client.py # GnoiClient: reboot(), reboot_status(), context manager | ||
| ├── system_pb2.py # vendored from openconfig/gnoi system.proto | ||
| ├── system_pb2_grpc.py | ||
| ├── types_pb2.py # dependency of system.proto | ||
| └── types_pb2_grpc.py | ||
| ``` |
There was a problem hiding this comment.
The design proposes adding a new subpackage under host_modules/gnoi/, but this doc doesn’t call out that setup.py uses an explicit packages=[...] list. Without explicitly adding host_modules.gnoi (or switching to package discovery), the new client/stubs won’t be installed in deployed environments and imports will fail. Please document this packaging requirement in the scope/implementation plan section.
606230e to
a4f3f04
Compare
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Signed-off-by: Dawei Huang <daweihuang@microsoft.com>
a4f3f04 to
eb0cf40
Compare
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
|
||
| New directory structure: | ||
| ``` | ||
| host_modules/gnoi/ |
There was a problem hiding this comment.
@hdwhdw Do you plan to implement this in sonic-host-services ? if available it can be used from containers like pmon
Description
Design document proposing the replacement of
docker exec gnmi gnoi_clientsubprocess calls ingnoi_shutdown_daemonwith direct Python gRPC calls.Key findings from analyzing the sonic-gnmi
gnoi_clientsource:Error handling is a Go
panic()— on any RPC failure,gnoi_clientdumps a Go panic stack trace to stderr and exits non-zero. The daemon only checks the return code and logs a generic "command failed" with zero context. Diagnosing production failures requires SSH + manual docker exec + reading Go stack traces.RebootStatus parsing is broken — the daemon checks
"reboot complete" in out_s.lower()but the actual output is JSON-marshaled protobuf ({"active":false,"status":{"status":"STATUS_SUCCESS"}}). The string "reboot complete" never appears, so the poll always times out regardless of DPU state.gnmi container dependency — if the gnmi container is unhealthy or restarting, DPU shutdown silently fails.
Proposed solution:
GnoiClientPython wrapper usinggrpcioGnoiRebootHandlerfor direct gRPC with structured error handlingresp.active == False+resp.status.status)Design doc:
doc/gnoi-native-grpc-design.mdTracking issue: #360