Skip to content

Commit 52c8867

Browse files
committed
Module graceful shutdown support
Provide support for SmartSwitch DPU module graceful shutdown. # Description: * **Single source of truth for transitions** * All components now use `sonic_platform_base.module_base.ModuleBase` helpers: * `set_module_state_transition(db, name, transition_type)` * `clear_module_state_transition(db, name)` * `get_module_state_transition(db, name) -> dict` * `is_module_state_transition_timed_out(db, name, timeout_secs) -> bool` * Eliminates duplicated logic and race-prone direct Redis writes. * **Correct table everywhere** * Standardized on **`CHASSIS_MODULE_TABLE`** (replaces `CHASSIS_MODULE_INFO_TABLE`). * HLD mismatch addressed in code (HLD fix tracked separately). * **Ownership & lifecycle** * The **initiator** of an operation (`startup`/`shutdown`/`reboot`) sets: * `state_transition_in_progress=True` * `transition_type=<op>` * `transition_start_time=<utc-iso8601>` * The **platform** (`set_admin_state()`) is responsible for clearing: * `state_transition_in_progress=False` * optionally `transition_end_time=<epoch>` (or similar end stamp). * CLI pre-clears only when a prior transition is **timed out**. * **Timeouts & policy** * Platform JSON path only: `/usr/share/sonic/device/{plat}/platform.json`; else **constants**. * Typical production values used: * `startup: 180s`, `shutdown: 180s` (≈ `graceful_wait 60s + power 120s`), `reboot: 120s`. * **Graceful wait** (e.g., waiting for “Graceful shutdown complete”) is a **platform policy** and implemented inside platform `set_admin_state()`—not in ModuleBase. * **Boot behavior** * `chassisd` on start: 1. **Clears stale flags once** (centralized sweep). 2. Runs `set_initial_dpu_admin_state()` which **marks transitions** via ModuleBase before calling platform `set_admin_state()`. 3. Leaves clearing to the platform or to well-defined status transitions (ONLINE/OFFLINE) where appropriate. * **gNOI shutdown daemon** * Listens on **`CHASSIS_MODULE_TABLE`** and triggers only when: * `state_transition_in_progress=True` **and** `transition_type=shutdown`. * Never clears the flag (ownership stays with the platform). * Bounded RPC timeouts and robust Redis access (swsssdk/swsscommon). * **CLI (`config chassis modules …`)** * Uses ModuleBase APIs for all set/get/timeout checks. * If a previous transition is stuck, `is_module_state_transition_timed_out()` → auto-clear then proceed. * Sets transition at the start of `startup`/`shutdown`; platform clears on completion. * Fabric card flow retained; edits are surgical. * **Redis robustness** * Helpers handle both stacks (swsssdk/swsscommon); no `hset(mapping=...)` usage. * Consistent HGETALL/HSET paths; resilient to connector differences. * **Race reduction & consistency** * Centralized writes prevent multi-writer races. * All transition writes include `transition_start_time`; clears may add an end stamp. * Existing PCI/file-lock logic left intact; unrelated behavior unchanged. * **Change scope** * Minimal, targeted diffs. * No background tasks added, no broad refactors beyond transition handling. * Behavior changes are limited to making transition semantics correct and uniform across repos. HLD: # 1991 sonic-net/SONiC#1991 sonic-platform-common: #567 sonic-net/sonic-platform-common#567 sonic-utilities: sonic-net/sonic-utilities#4031 sonic-platform-daemons: sonic-net/sonic-platform-daemons#667 How to verify it Issue the "config chassis modules shutdown DPUx" command Verify the DPU module is gracefully shut by checking the logs in /var/log/syslog on both NPU and DPU
1 parent 1633661 commit 52c8867

File tree

8 files changed

+1047
-2
lines changed

8 files changed

+1047
-2
lines changed

data/debian/rules

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,5 +20,6 @@ override_dh_installsystemd:
2020
dh_installsystemd --no-start --name=procdockerstatsd
2121
dh_installsystemd --no-start --name=determine-reboot-cause
2222
dh_installsystemd --no-start --name=process-reboot-cause
23+
dh_installsystemd --no-start --name=gnoi-shutdown
2324
dh_installsystemd $(HOST_SERVICE_OPTS) --name=sonic-hostservice
2425

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
[Unit]
2+
Description=gNOI based DPU Graceful Shutdown Daemon
3+
Requires=database.service
4+
Wants=network-online.target
5+
After=network-online.target database.service
6+
7+
[Service]
8+
Type=simple
9+
ExecStartPre=/usr/bin/python3 /usr/local/bin/check_platform.py
10+
ExecStartPre=/bin/bash /usr/local/bin/wait-for-sonic-core.sh
11+
ExecStart=/usr/bin/python3 /usr/local/bin/gnoi_shutdown_daemon.py
12+
Restart=always
13+
RestartSec=5
14+
15+
[Install]
16+
WantedBy=multi-user.target

scripts/check_platform.py

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
#!/usr/bin/env python3
2+
"""
3+
Check if the current platform is a SmartSwitch NPU (not DPU).
4+
Exit 0 if SmartSwitch NPU, exit 1 otherwise.
5+
"""
6+
import sys
7+
8+
def main():
9+
try:
10+
from sonic_py_common import device_info
11+
from utilities_common.chassis import is_dpu
12+
13+
# Check if SmartSwitch NPU (not DPU)
14+
if device_info.is_smartswitch() and not is_dpu():
15+
sys.exit(0)
16+
else:
17+
sys.exit(1)
18+
except (ImportError, AttributeError, RuntimeError) as e:
19+
sys.stderr.write("check_platform failed: {}\n".format(str(e)))
20+
sys.exit(1)
21+
22+
if __name__ == "__main__":
23+
main()

0 commit comments

Comments
 (0)