Please note: This website includes an accessibility system. Press Control-F11 to adjust the website to the visually impaired who are using a screen reader; Press Control-F10 to open an accessibility menu.

Lenovo's best practices in response to the Intel® uncorrectable memory error handling on Gen 1, Gen 2 or "H" SKUs of Gen 3 Xeon® Scalable processors

Lenovo's Best Practices for Intel® Uncorrectable Memory Error Handling in Xeon® Scalable Processors

Lenovo's Best Practices for Intel® Uncorrectable Memory Error Handling in Xeon® Scalable Processors

Description

Lenovo has been #1 in reliability for 7 years, and wants to inform its customers of reductions inherent in all industry systems using certain generations of Intel® processors which generationally reduced the error-checking and correcting capabilities available to the OEM system vendors. A combination of DDR memory errors and architectural changes present in corrective memory error handling logic, on Gen 1 Xeon® Scalable processors (codenamed "Skylake"), Gen 2 Xeon® Scalable processors (codenamed "Cascade Lake") and Gen 3 Xeon® Scalable processors (codenamed "Cooper Lake-6") may result in a higher rate of runtime uncorrectable memory errors (UCE) compared to previous generations of hardware. This is due to the implemented changes in the Single Device Data Correction (SDDC). SDDC is a fundamental Intel RAS (Reliability, Availability, Serviceability) feature available on all platforms. As a result of these architectural changes and memory DIMM errors, there is a difference in which errors will be corrected between the previous generation of processors and the Xeon® Scalable processor family generation. For more information from Intel® please see How do I Improve Memory Handling with 1st, 2nd, or 3rd Generation Intel® Xeon® Scalable Processors. This article will focus on key strategies in mitigating DDR uncorrectable errors which sometimes result in application termination or server crashes.

The issue can be identified by observing uncorrectable memory error or machine check error events reported by Lenovo ThinkSystem or ThinkAgile product:

XCC event log:

FQXSFMA0002M : An uncorrectable memory error has been detected on DIMM [arg1] at address [arg2]. [arg3]
FQXSFPU0062F : System uncorrectable error happened in Processor [arg1] Core [arg2] MC bank [arg3] with MC Status [arg4], MC Address [arg5], and MC Misc [arg6].
FQXSFPU0027N : System uncorrected recoverable error has occurred on Processor [arg1] Core [arg2] MC bank [arg3] with MC Status [arg4], MC Address [arg5], and MC Misc [arg6].

(where XCC = Lenovo XClarity Controller)

Each line below will expand with additional information by clicking on the arrow on right hand side of the title
Drop down arrow

Applicable Systems

The system may be any of the following Lenovo servers:

Best Practices

ThinkSystem firmware supports RAS features offered by Intel®Scalable processor which can greatly reduce the frequency of DDR uncorrectable errors. Therefore, system administrators and operators should take advantage of the RAS features supported by Gen1/Gen2/Gen3 Intel® Xeon® Scalable processors and plan for routine on-target memory tests available within LXPM. The best practices outlined in this article should be applicable to future CPU generations which will support memory beyond DDR4 generation offered with Gen 3 Xeon®Scalable processors (codenamed "Cooper Lake-6").

Maintain Code Currency

Update production ThinkSystem servers to firmware stack released in first quarter of 2021 or higher which will ensure that all known Intel and Lenovo firmware fixes have been applied. This can be done by navigating to Lenovo Support Portal URL: https://support.lenovo.com and by selecting the appropriate Product Group, type of System, Product name, Product machine type, and Operating system.

Plan for On-Target Memory Screening

Plan to run LXPM Advanced Memory Tests at least every 6 months and prior to new system deployment or system maintenance, see URL: HT511056 - LXPM Advanced Memory Test reduces DIMM errors. The following steps should be used when considering this option.

amt

  1. Keep system firmware(UEFI & BMC/XCC) up-to-date: for best results, ensure that the target system is running the latest firmware or the firmware stack released after the first quarter of 2021.
    • Check System Information during POST or select System Summary to check the system's firmware information:
      post

      sys_info
       
  2. When using Command Line Interface (CLI) method refer to commands below:

    To enable the AMT, run:

    OneCli.exe config set Memory.MemoryTest Enable --imm xcc_user_id:xcc_password@xcc_external_ip
    OneCli.exe config set Memory.AdvMemTestOptions 0xF0000 --override --imm xcc_user_id:xcc_password@xcc_external_ip
    

    To disable the AMT, run:

    OneCli.exe config set Memory.MemoryTest Automatic --imm xcc_user_id:xcc_password@xcc_external_ip
    OneCli.exe config set Memory.AdvMemTestOptions 0 --override --imm xcc_user_id:xcc_password@xcc_external_ip
    
  3. When using the Graphical User Interface (GUI) power on the server and press F1 to enter ThinkSystem UEFI setup menu, XClarity Provisioning Manager.
    F1

    sys_info
     
  4. Select Diagnostics option from left side menu.
    diags
     
  5. Select Run Diagnostics from the Diagnostics screen.
    run_diags
     
  6. Select Memory Test from the Dashboard.
    sele_mem
     
  7. Select Advanced Memory Test from the Memory Test menu.
    amt

    amt_run_diags
     
  8. After the Advanced Memory Test (AMT) is selected, the system will reboot, and the memory test will run during UEFI POST. This test is very similar to manufacturing level test and cannot be disabled until one full test cycle has completed. Rebooting the system in the middle of test operation will restart memory test from the beginning unless CMOS battery is removed. The system will return to the Diagnostics Page and provide an interface to save system logs when in Graphical System Setup.
    amt_in_progress
     
  9. Time required for the test to complete varies from system.  After the test completes, system will return to the Memory test page in LXPM with a prompt to insert a USB drive to the system to save the log file. Insert a USB drive to the system and click Retry to continue. 
     
  10. If a user would like to bypass the option to save test log, then F1 System Setup needs to be configured to run in Text Mode.

    Redfish command to enable/disable AMT

    {
        "Attributes": {
            "Memory_MemoryTest": "Enabled",  
            "Memory_AdvMemTestOptions": 983040 
        }
    }
    

    Note: For more details please refer to Advanced Memory Test on XeonBased ThinkSystem Servers.

Enable Machine Check Recovery (MCA) and Local Machine Check (LMCE) recovery

MCA Recovery allows for OS to decide if the error could be recovered by the OS without taking the system down for more information about this RAS feature please see details section. For more detailed information on MCA Recovery please see the Additional Information section.

The following steps should be used when considering this option.

  1. When using CLI method select “AdvancedRAS.MachineCheckRecovery=Enable”. This feature is enabled by default in UEFI setup.
  2. When using GUI method:
    1. Power on the server.
    2. Press F1 to enter System Setup, LXPM.
      F1
       
    3. From the left navigation mention, select System Settings Recovery and RAS as shown below.
      rec_ras
       
    4. Select Advanced RAS.
      adv_ras
       
    5. Enable Machine Check Recovery.
      enable_ras
       

Note: MCA recovery and Local Machine Check (LMCE) Recovery depends on the Operating System support, so consult with your OS provider for MCA and LMCE capability as each Operating System Vendor adopt RAS features using their own release cycles. Lenovo based platform firmware enables LMCE based recovery by default, but this setting is not exposed to User Space in UEFI Setup. The benefits of LMCE over MCE are discussed in the following paper: Handling Local Machine Check Exceptions In Linux.

Windows: For a detailed description of how Windows uses RAS features consult the Windows Hardware Error Architecture (WHEA) design guide. Refer to section “Additional Information” for the list of Supported RAS features by Operating System.

VMware: Machine Check recovery is supported by the kernel in ESXi 5 release and higher. Refer to section “Additional Information” for the list of Supported RAS features by Operating System.

In addition, the user should take advantage of Local Machine Check (LMCE) based Recovery which is enabled by default in ESXi 7.0 version, see Lenovo ThinkSystem Servers with Intel® Optane™ DC Persistent Memory Module Support

For the Lenovo ThinkSystem SR850P and SR850, due to a known hardware limitation, it is required to enable “useLMCE” kernel boot flag to support local machine check error recovery with ESXi 6.7 U2 and higher versions.

  • To enable local MCE recovery on ESXi 6.7 U2 system:
    In the ESXi console, run these two esxcli commands to set the kernel boot option, use LMCE to TRUE, then reboot the system for changes to take effect.
     esxcli system settings kernel set -s useLMCE -v TRUE
     /sbin/reboot
    
    After a reboot, verify that the setting took effect by running this command:
     esxcli system settings kernel list -o “useLMCE

Linux: Refer to section “Additional Information” for the list of Supported RAS features by Operating System. Kernel Support List for MCA recovery by major Linux vendors:

pic5

Source: Engineering Practice to Reduce Server Crash Rate from DDR Uncorrectable Errors (UCE) in Hyperscale Cloud Data Center, see Engineering Practice to Reduce Server Crash

Keep Patrol Scrub Enabled

To avoid an accumulation of soft errors which can turn into uncorrectable error (UCE) the Intel chipset has a built-in memory scrubbing engine. It reads data from each DDR memory location and corrects bit errors (if any) with an error-correcting code (ECC), then writes corrected data back to the same location. Patrol scrubbing is set for 24-hour interval where each address is checked during this period.

  • When using CLI method select “Memory.PatrolScrub=Enable”. This feature is enabled by default in UEFI setup.

Disable Cold Boot Fast

Force Memory Training upon each reboot by disabling Cold Boot Fast, this will increase system boot time during POST. The purpose of Cold Boot Fast is to skip memory training if no configuration change has been detected for past 90 days which improves system boot time. Disabling Cold Boot Fast allows for retraining of the memory interface, compensating for any significant changes in environmental conditions.

  • When using CLI method select “Memory.ColdBootFast=Disable”.
  • This feature is enabled by default in UEFI setup.

Take advantage of Post Package Repair

This is an industry led feature defined by JEDEC to enable Boot Time Post Package Repair (PPR) to replace a row, within a DRAM, that is determined to be faulty. The intent of the feature is to reduce DIMM replacements in the field due to the presence of a bad cells. During runtime, a DIMM experiencing correctable faults can be scheduled to have a PPR performed on a subsequent boot cycle. The DRAM experiencing the fault, within the DIMM, will have the row internally replaced by a spare row, within the same DRAM. This PPR corrective fusing process is permanent.

For example, if your system asserted a runtime PFA, then upon next reboot cycle, UEFI will attempt a repair. This will be indicated by a “Self-Heal” message in the event log, and after completion, the PFA will be de-asserted.

  • This feature is enabled by default in UEFI setup.

Set System Operating Mode to Maximum Performance

In some situations, it was observed that disabling power management policies in system UEFI and vSphere client has resolved intermittent 'Uncorrectable Bus Errors' or system reboots and memory errors.

  • When using CLI method select “OperatingModes.ChooseOperatingMode=Maximum Performance".
  • To enable Maximum Performance using CLI method, run:
    OneCli.exe config set OperatingModes.ChooseOperatingMode "Maximum Performance" --imm xcc_user_id:xcc_password@xcc_external_ip

For reference see System tuning for VMware on x86 Servers and ThinkSystem, see System tuning for VMware on x86 Servers and ThinkSystem
For reference see Recommended UEFI settings - Lenovo ThinkAgile HX systems, see URL: Recommended UEFI settings

Enable Address Range Mirroring / Partial Memory Mirroring

Address Range Mirroring is a RAS feature available on the Intel Xeon Scalable Family platforms which allows for granular control how much memory is allocated for redundancy, see details section for additional information. The following steps should be used when considering this option. For more detailed information on Address Range Mirroring please see the Additional Information section.

  1. When using CLI method select “Memory.MirrorMode=Partial”, “Memory.Mirrorbelow4GB=Enable”
  2. When Address Range Mirroring is enabled memory content will be duplicated on the remote DIMM in the partition. This means that not all system memory will be available to the Operating system. For example, with partial mirroring enabled UEFI will dedicate 36GB of fixed amount of memory to the mirror per physical processor.
  3. Follow the steps below to enable the Partial Mirror Mode for memory redundancy:
    1. Power on the server.
    2. Press the F1 key to enter LXPM:
      F1
       
    3. Select the UEFI Setup on the left navigation menu.
      uefi_setup
       
    4. Select System Settings.
      sys_setting
       
    5. Select Memory in the center pane.
      memory
       
    6. Scroll down to the bottom and select Mirror Configuration.
      mem_config
       
    7. Set Mirror Mode to Partial and enable Mirror below 4 GB to ensure memory mirroring includes low address ranges. 
      enable_mirror_mem_below_4gb
       

      Note: Mirror below 4GB is shared with MM config Base for which the default setting is 3 GB. In this example, we enabled Mirror below 4 GB.

    8. Save the configuration and exit the UEFI setup menu.
  4. The memory information of the memory mirror is shown on the system boot screen. The usable memory capacity is reduced according to the configuration that was set in UEFI. Figure below shows the memory independent mode on the left side and Address Range Mirroring mode on the right side where 1536G memory is reduced to usable capacity of 1461GB = 1536(Total)-36(CPU1)-36(CPU2)-3(MM Config).
    pic7
     
  5. Note:
  6. After Partial Memory Mirroring is set in UEFI one can use “esxcli hardware memory get” to verify that Reliable Memory is used and is over ‘0’ Bytes.
    Refer to the example below:
    Before turning on address range partial memory mirroring:
    [root@h2:~] esxcli hardware memory get
       Physical Memory: 549657530368 Bytes
       Reliable Memory: 0 Bytes
       NUMA Node Count: 2
    After turning on address range partial memory mirroring:
    [root@h2:~] esxcli hardware memory get
       Physical Memory: 480938061824 Bytes
       Reliable Memory: 68619579392 Bytes
       NUMA Node Count: 2

Additional Information

Supported RAS features by Operating System*

A set of tables listed below show when Operating System vendors have first adopted individual RAS features which can be used to improve system stability and resiliency against hardware errors.

* The tables below list all Major Operating System vendors.

Supported RAS Features on Windows Server WS2016 WS2019 WS2022 All Future Versions
MCA2.0 Recovery-Execution path X X X X
MCA2.0 Recovery-Non-Execution path X X X X
Local Machine (LMCE) based Recovery-Execution   X X X
Address Range/Partial Mirroring     X X

 

Supported RAS Features on VMware ESXi 5 GA 5.5 6 GA 6.5-6.7 (all) 7.0 (all) All Future Versions
MCA2.0 Recovery-Execution path X X X X X X
MCA2.0 Recovery-Non-Execution path X X X X X X
Local Machine (LMCE) based Recovery-Execution       X X X
Address Range/Partial Mirroring   X X X X X

 

Supported RAS Features on RHEL 7.2 7.3 7.4 (all) 8.x (all) 9.x (all) All Future Versions
MCA2.0 Recovery-Execution path X X X X X X
MCA2.0 Recovery-Non-Execution path X X X X X X
Local Machine (LMCE) based Recovery-Execution   X X X X X
Address Range/Partial Mirroring     X X X X

 

Supported RAS Features on SUSE 11.04 12 GA 12 SP3 12 SP4 (all) 15 (all) All Future Versions
MCA2.0 Recovery-Execution path X X X X X X
MCA2.0 Recovery-Non-Execution path X X X X X X
Local Machine (LMCE) based Recovery-Execution     X X X X
Address Range/Partial Mirroring       X X X

 

Supported RAS Features on Ubuntu 14.04 16.04 18.04 (all) 20.04 (all) 21.04 (all) All Future Versions
MCA2.0 Recovery-Execution path X X X X X X
MCA2.0 Recovery-Non-Execution path X X X X X X
Local Machine (LMCE) based Recovery-Execution   X X X X X
Address Range/Partial Mirroring   X X X X X

MCA Recovery

The new Intel Xeon Scalable Family processors support recovery from some memory errors based on the Machine Check Architecture (MCA) Recovery mechanism. This requires the OS to declare a memory page “poisoned”, kill the processes associated with the page and avoid using the page in the future. The MCA mechanism is used to detect, signal, and record machine fault information. Some of these faults are correctable, whereas others are uncorrectable. The MCA mechanism is intended to assist CPU designers and CPU debuggers in diagnosing, isolating, and understanding processor failures. It is also intended to help system administrators detect transient and age-related failures, suffered during long-term operation of the server. The MCA Recovery feature is a part of the fault tolerant capabilities of servers based on the Intel Xeon Scalable Family processors, such as the ThinkSystem portfolio of servers. These capabilities allow systems to continue to operate when an uncorrected error is detected in the system. If not for these capabilities, the system would crash and might require hardware replacement or a system reboot.

MCA Recovery allows for OS to decide if the error could be recovered by the OS w/o taking the system down. If following pre-conditions are met:

  • Memory UCE is non-Fatal Error
  • Memory Failure address isn’t in kernel space
  • The impacted application can be killed by the host OS.

Figure below shows the system error handling flow with a Linux operating system.

 system error handling flow with a Linux operating system

Source: see URL LP0778 - Demonstrating the Memory RAS Features of Lenovo ThinkSystem Servers
Software Recoverable Action Required (SRAR): There are two types of such errors detected by Data Cache Unit (DCU) and detected by Instruction Fetch Unit (IFU) also known as MCA recovery execution path.
Software Recoverable Action Optional (SRAO): There are two types of such errors detected by memory patrol scrub and detected by Last Level Cache (LLC) explicit writeback transaction also known as MCA recovery non-execution path.

When a SRAR/SRAO happens, the MCA recovery will be triggered. If the kernel can perform a successful recovery by terminating the application or Virtual Machine which consumed the memory uncorrectable error and the system should stay online if no additional Uncorrectable Errors detected.

SRAR/SRAO Virtual Machine

Source: Engineering Practice to Reduce Server Crash Rate from DDR Uncorrectable Errors (UCE) in Hyperscale Cloud Data Center, see URL: Intel® Engineering Practice to Reduce Server Crash Rate

Address Range Mirroring / Partial Memory Mirroring

Address Range Mirroring is a new memory RAS feature on the Intel Xeon Scalable Family platform that allows greater granularity in selecting how much memory is dedicated for redundancy. Memory mirroring implementations (full mirror mode or address range mode) are designed to allow the mirroring of critical memory regions to increase the stability of physical memory. The mirrored memory is transparent to the OS and applications. An illustration below is showing Address Range Mirroring in practice where the green address range and orange address range are in mirror.

pic10

The Intel Xeon Sliver SKU’s and above support up to two mirror ranges in one socket, one mirror range per integrated Memory Controller (iMC). The range is defined by the value programmed in the Target Address Decoder 0 (TAD0) register for the server. The TAD0 defines the size of the primary and secondary mirror ranges. The secondary mirror range is reserved for redundancy and not reported in the total memory size. To enable Address Range Mirroring, there is a Control and Status Register (CSR) bit that enables TAD0 use for mirroring.

Address Range Mirroring offers the following benefits:

  • Provides further granularity to memory mirroring by allowing the firmware or OS to determine a range of memory addresses to be mirrored, leaving the rest of the memory in the socket in non-mirror mode.
  • Reduces the amount of memory reserved for redundancy.
  • Improves high availability, avoiding uncorrectable errors in the kernel memory of the Operating System by allocating all kernel memory from the mirrored memory.

Address Range Mirroring has the following OS and firmware requirements:

  • The System boot mode must be set to 'UEFI Boot' .
  • Requires OS support to fully utilize Address Range Mirroring.
  • The OS must be aware of mirrored region.
  • Dependency on system firmware to configure the Address Range Mirroring:
    • Using UEFI setup to enable Address Range Mirroring with fixed mirror size. ThinkSystems shipped with Gen 1, Gen 2 and Gen3 Intel Xeon processors support mirror mode configuration through UEFI Setup Page as outlined earlier.
    • Using OS setup commands such as “efibootmgr and kernelcore=mirror” to configure the Address Range Mirroring with different mirror size through the firmware-OS interface. ThinkSystems shipped with Gen 1, Gen 2 and Gen3 Intel Xeon processors have basic support and there’s a plan to have full support in a future generation of platforms which will allow the OS request % of memory to be mirrored based on its unique needs.
Document ID:HT512486
Original Publish Date:06/07/2021
Last Modified Date:04/01/2025