Please note: This website includes an accessibility system. Press Control-F11 to adjust the website to the visually impaired who are using a screen reader; Press Control-F10 to open an accessibility menu.

Best Practice for six SD650-N V2 nodes with 400 W GPU running in a DW612 enclosure equipped with six 2400 W power supplies - Lenovo ThinkSystem SD650-N V2

Best Practice for six SD650-N V2 nodes with 400 W GPU running in a DW612 enclosure equipped with six 2400 W power supplies - Lenovo ThinkSystem SD650-N V2

Identify Your Device
Enter your serial number or select Browse Product to find your specific server/appliance.

Best Practice for six SD650-N V2 nodes with 400 W GPU running in a DW612 enclosure equipped with six 2400 W power supplies - Lenovo ThinkSystem SD650-N V2

Symptom

When six (6) SD650-N V2 nodes with 400W GPUs are configured with default settings and simultaneously running the same application in a DW612 enclosure equipped with six (6) 2400W power supplies, the nodes may experience the following intermittent system throttle events (from either the CPU or GPU baseboard) accompanied by power supply throttle events on SMM2.

# SMM2 Events

0807019f | Warning | 2021-05-01 17:18:01 | +0000 | PSU 5 Throttle: Power Supply sensor, transition to Non-Critical from OK was asserted
180701f1 | Warning | 2021-05-01 17:18:02 | +0000 | PSOC Throttle O: Chassis sensor, transition to Non-Critical from OK was asserted
0887019f | Normal | 2021-05-01 17:18:02 | +0000 | PSU 5 Throttle: Power Supply sensor, transition to Non-Critical from OK was deasserted
188701f1 | Normal | 2021-05-01 17:18:02 | +0000 | PSOC Throttle O: Chassis sensor, transition to Non-Critical from OK was deasserted

# XCC Events

I 07/15/2021 14:08:03.759 Sensor GPU Board has transitioned to a less severe state from critical.
I 07/15/2021 14:08:03.658 Sensor GPU Board has transitioned to normal state.
I 07/15/2021 14:07:49.074 The Processor processor 2 is no longer operating in a Degraded State.
I 07/15/2021 14:07:48.509 The Processor processor 1 is no longer operating in a Degraded State.
E 07/15/2021 14:07:43.741 Sensor GPU Board has transitioned to critical from a less severe state.
W 07/15/2021 14:07:43.604 The Processor processor 2 is operating in a Degraded State.
W 07/15/2021 14:07:41.615 The Processor processor 1 is operating in a Degraded State.

(where SMM = System Management Module, XCC = Lenovo XClarity Controller)

Affected Configurations

The system may be any of the following Lenovo servers:

  • ThinkSystem SD650-N V2, Type 7D1N, any model

This tip is not software specific.

This tip is not option specific.

The system has the symptom described above.

Solution

This behavior can be corrected by applying both actions below.

  1. Apply the 3Q21 or later version of Lenovo Scalable Infrastructure (LeSI) Best Recipes including the SD650-N V2 XCC firmware.

    Lenovo Scalable Infrastructure Best Recipes

  2. Use NVIDIA utility nvidia-smi to get the GPU clock fixed at a certain level to keep the GPU running with a steady power level with no boost. Users may obtain the best parameter after some experiments.

Workaround

Use NVIDIA utility nvidia-smi to get the GPU clock fixed at a certain level to keep the GPU running with a steady power level with no boost. Users may obtain the optimized parameters after some experiments.

Additional Information

In specific conditions, SMM2 reports one or two power supply throttle and unthrottle events within one or two seconds in the case of GPU running at peak load on nodes in the same enclosure at the same time. When this condition occurs, the throttle and unthrottle events of CPUs or GPU baseboard are usually accompanied by power supply events.

The GPU is constantly monitoring temperature and power when running. It adjusts the clocks upward when possible for higher performance and downward if needed to maintain power and thermal limits.

The workload pattern of each kind of GPU applications behaves variously. Some applications need the GPU to run with boost clock in a short period of time, while the others may sometimes need the GPU to run at a constant level with no boost.

For applications that need GPU to constantly run at a high level without intermittent peak, it is common practice to use nvidia-smi to set a fixed GPU clock frequency when the GPU application is running to prevent jitters and improve GPU application performance. GPU clock settings can also be used on production systems if needed.

Document ID:HT512757
Original Publish Date:08/23/2021
Last Modified Date:07/07/2022