Skip to content

Commit ee9ef3f

Browse files
Module graceful shutdown support (#255)
Provide support for SmartSwitch DPU module graceful shutdown. Description: Single source of truth for transitions All components now use sonic_platform_base.module_base.ModuleBase helpers: set_module_state_transition(db, name, transition_type) clear_module_state_transition(db, name) get_module_state_transition(db, name) -> dict is_module_state_transition_timed_out(db, name, timeout_secs) -> bool Eliminates duplicated logic and race-prone direct Redis writes. Correct table everywhere Standardized on CHASSIS_MODULE_TABLE (replaces CHASSIS_MODULE_INFO_TABLE). HLD mismatch addressed in code (HLD fix tracked separately). Ownership & lifecycle The initiator of an operation (startup/shutdown/reboot) sets: state_transition_in_progress=True transition_type=<op> transition_start_time=<utc-iso8601> The platform (set_admin_state()) is responsible for clearing: state_transition_in_progress=False optionally transition_end_time=<epoch> (or similar end stamp). CLI pre-clears only when a prior transition is timed out. Timeouts & policy Platform JSON path only: /usr/share/sonic/device/{plat}/platform.json; else constants. Typical production values used: startup: 180s, shutdown: 180s (≈ graceful_wait 60s + power 120s), reboot: 120s. Graceful wait (e.g., waiting for “Graceful shutdown complete”) is a platform policy and implemented inside platform set_admin_state()—not in ModuleBase. Boot behavior chassisd on start: Clears stale flags once (centralized sweep). Runs set_initial_dpu_admin_state() which marks transitions via ModuleBase before calling platform set_admin_state(). Leaves clearing to the platform or to well-defined status transitions (ONLINE/OFFLINE) where appropriate. gNOI shutdown daemon Listens on CHASSIS_MODULE_TABLE and triggers only when: state_transition_in_progress=True and transition_type=shutdown. Never clears the flag (ownership stays with the platform). Bounded RPC timeouts and robust Redis access (swsssdk/swsscommon). CLI (config chassis modules …) Uses ModuleBase APIs for all set/get/timeout checks. If a previous transition is stuck, is_module_state_transition_timed_out() → auto-clear then proceed. Sets transition at the start of startup/shutdown; platform clears on completion. Fabric card flow retained; edits are surgical. Redis robustness Helpers handle both stacks (swsssdk/swsscommon); no hset(mapping=...) usage. Consistent HGETALL/HSET paths; resilient to connector differences. Race reduction & consistency Centralized writes prevent multi-writer races. All transition writes include transition_start_time; clears may add an end stamp. Existing PCI/file-lock logic left intact; unrelated behavior unchanged. Change scope Minimal, targeted diffs. No background tasks added, no broad refactors beyond transition handling. Behavior changes are limited to making transition semantics correct and uniform across repos. HLD: # 1991 sonic-net/SONiC#1991 sonic-platform-common: #567 sonic-net/sonic-platform-common#567 sonic-utilities: sonic-net/sonic-utilities#4031 sonic-platform-daemons: sonic-net/sonic-platform-daemons#667 How to verify it Issue the "config chassis modules shutdown DPUx" command Verify the DPU module is gracefully shut by checking the logs in /var/log/syslog on both NPU and DPU
1 parent f1de8e3 commit ee9ef3f

File tree

8 files changed

+1047
-2
lines changed

8 files changed

+1047
-2
lines changed

data/debian/rules

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,5 +20,6 @@ override_dh_installsystemd:
2020
dh_installsystemd --no-start --name=procdockerstatsd
2121
dh_installsystemd --no-start --name=determine-reboot-cause
2222
dh_installsystemd --no-start --name=process-reboot-cause
23+
dh_installsystemd --no-start --name=gnoi-shutdown
2324
dh_installsystemd $(HOST_SERVICE_OPTS) --name=sonic-hostservice
2425

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
[Unit]
2+
Description=gNOI based DPU Graceful Shutdown Daemon
3+
Requires=database.service
4+
Wants=network-online.target
5+
After=network-online.target database.service
6+
7+
[Service]
8+
Type=simple
9+
ExecStartPre=/usr/bin/python3 /usr/local/bin/check_platform.py
10+
ExecStartPre=/bin/bash /usr/local/bin/wait-for-sonic-core.sh
11+
ExecStart=/usr/bin/python3 /usr/local/bin/gnoi_shutdown_daemon.py
12+
Restart=always
13+
RestartSec=5
14+
15+
[Install]
16+
WantedBy=multi-user.target

scripts/check_platform.py

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
#!/usr/bin/env python3
2+
"""
3+
Check if the current platform is a SmartSwitch NPU (not DPU).
4+
Exit 0 if SmartSwitch NPU, exit 1 otherwise.
5+
"""
6+
import sys
7+
8+
def main():
9+
try:
10+
from sonic_py_common import device_info
11+
from utilities_common.chassis import is_dpu
12+
13+
# Check if SmartSwitch NPU (not DPU)
14+
if device_info.is_smartswitch() and not is_dpu():
15+
sys.exit(0)
16+
else:
17+
sys.exit(1)
18+
except (ImportError, AttributeError, RuntimeError) as e:
19+
sys.stderr.write("check_platform failed: {}\n".format(str(e)))
20+
sys.exit(1)
21+
22+
if __name__ == "__main__":
23+
main()

0 commit comments

Comments
 (0)