Skip to content

fix(iluvatar): move memory query out of health check condition#13

Open
stezpy wants to merge 1 commit into
gpustack:mainfrom
stezpy:fix/iluvatar_gpu_vram_fail
Open

fix(iluvatar): move memory query out of health check condition#13
stezpy wants to merge 1 commit into
gpustack:mainfrom
stezpy:fix/iluvatar_gpu_vram_fail

Conversation

@stezpy
Copy link
Copy Markdown

@stezpy stezpy commented Jun 3, 2026

在2.1.2版本上发现添加的worker节点不能正常读取天数智芯gpu信息
与runtime中的iluvatar.py中的显存检测有关系,默认的条件下会跳过显存检测导致web ui上不显示gpu信息
目前对gpu显存检测流程作了修改

Co-authored-by: Cursor Agent <cursoragent@cursor.com>
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request modifies the Iluvatar GPU detector to retrieve device memory information regardless of whether the health check is enabled. The reviewer suggested replacing the use of contextlib.suppress with a try...except block that logs a debug exception if retrieving the memory info fails, preventing silent failures and improving troubleshooting.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +140 to +147
with contextlib.suppress(pyixml.NVMLError):
dev_mem_info = pyixml.nvmlDeviceGetMemoryInfo(dev)
dev_mem = byte_to_mebibyte( # byte to MiB
dev_mem_info.total,
)
dev_mem_used = byte_to_mebibyte( # byte to MiB
dev_mem_info.used,
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

在获取显存信息失败时,使用 contextlib.suppress 会静默忽略错误。如果显存获取失败,Web UI 上将显示为 0 且没有任何错误日志,这会给排查问题带来困难。

建议使用 try...except pyixml.NVMLError 并在 except 块中调用 debug_log_exception 记录调试日志。这与代码库中其他地方的处理方式(例如获取设备间拓扑距离失败时的处理)保持一致,有利于后续的维护和问题排查。

Suggested change
with contextlib.suppress(pyixml.NVMLError):
dev_mem_info = pyixml.nvmlDeviceGetMemoryInfo(dev)
dev_mem = byte_to_mebibyte( # byte to MiB
dev_mem_info.total,
)
dev_mem_used = byte_to_mebibyte( # byte to MiB
dev_mem_info.used,
)
try:
dev_mem_info = pyixml.nvmlDeviceGetMemoryInfo(dev)
dev_mem = byte_to_mebibyte( # byte to MiB
dev_mem_info.total,
)
dev_mem_used = byte_to_mebibyte( # byte to MiB
dev_mem_info.used,
)
except pyixml.NVMLError:
debug_log_exception(
logger,
"Failed to get memory info for device %d",
dev_index,
)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

由于原来代码里也没有这么处理,希望代码审核时给个意见,这边都可以配合改

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant