Best Practice for six SD650-N V2 nodes with 400 W GPU running in a DW612 enclosure equipped with six 2400 W power supplies - Lenovo ThinkSystem SD650-N V2
Best Practice for six SD650-N V2 nodes with 400 W GPU running in a DW612 enclosure equipped with six 2400 W power supplies - Lenovo ThinkSystem SD650-N V2
Best Practice for six SD650-N V2 nodes with 400 W GPU running in a DW612 enclosure equipped with six 2400 W power supplies - Lenovo ThinkSystem SD650-N V2
Symptom
When six (6) SD650-N V2 nodes with 400W GPUs are configured with default settings and simultaneously running the same application in a DW612 enclosure equipped with six (6) 2400W power supplies, the nodes may experience the following intermittent system throttle events (from either the CPU or GPU baseboard) accompanied by power supply throttle events on SMM2.
# SMM2 Events
0807019f | Warning | 2021-05-01 17:18:01 | +0000 | PSU 5 Throttle: Power Supply sensor, transition to Non-Critical from OK was asserted 180701f1 | Warning | 2021-05-01 17:18:02 | +0000 | PSOC Throttle O: Chassis sensor, transition to Non-Critical from OK was asserted 0887019f | Normal | 2021-05-01 17:18:02 | +0000 | PSU 5 Throttle: Power Supply sensor, transition to Non-Critical from OK was deasserted 188701f1 | Normal | 2021-05-01 17:18:02 | +0000 | PSOC Throttle O: Chassis sensor, transition to Non-Critical from OK was deasserted
# XCC Events
I 07/15/2021 14:08:03.759 Sensor GPU Board has transitioned to a less severe state from critical. I 07/15/2021 14:08:03.658 Sensor GPU Board has transitioned to normal state. I 07/15/2021 14:07:49.074 The Processor processor 2 is no longer operating in a Degraded State. I 07/15/2021 14:07:48.509 The Processor processor 1 is no longer operating in a Degraded State. E 07/15/2021 14:07:43.741 Sensor GPU Board has transitioned to critical from a less severe state. W 07/15/2021 14:07:43.604 The Processor processor 2 is operating in a Degraded State. W 07/15/2021 14:07:41.615 The Processor processor 1 is operating in a Degraded State.
(where SMM = System Management Module, XCC = Lenovo XClarity Controller)
Affected Configurations
The system may be any of the following Lenovo servers:
- ThinkSystem SD650-N V2, Type 7D1N, any model
This tip is not software specific.
This tip is not option specific.
The system has the symptom described above.
Solution
This behavior can be corrected by applying both actions below.
- Apply the 3Q21 or later version of Lenovo Scalable Infrastructure (LeSI) Best Recipes including the SD650-N V2 XCC firmware.
- Use NVIDIA utility nvidia-smi to get the GPU clock fixed at a certain level to keep the GPU running with a steady power level with no boost. Users may obtain the best parameter after some experiments.
Workaround
Use NVIDIA utility nvidia-smi to get the GPU clock fixed at a certain level to keep the GPU running with a steady power level with no boost. Users may obtain the optimized parameters after some experiments.
Additional Information
In specific conditions, SMM2 reports one or two power supply throttle and unthrottle events within one or two seconds in the case of GPU running at peak load on nodes in the same enclosure at the same time. When this condition occurs, the throttle and unthrottle events of CPUs or GPU baseboard are usually accompanied by power supply events.
The GPU is constantly monitoring temperature and power when running. It adjusts the clocks upward when possible for higher performance and downward if needed to maintain power and thermal limits.
The workload pattern of each kind of GPU applications behaves variously. Some applications need the GPU to run with boost clock in a short period of time, while the others may sometimes need the GPU to run at a constant level with no boost.
For applications that need GPU to constantly run at a high level without intermittent peak, it is common practice to use nvidia-smi to set a fixed GPU clock frequency when the GPU application is running to prevent jitters and improve GPU application performance. GPU clock settings can also be used on production systems if needed.