Abstract

If multiple drives in a Distributed RAID array fail concurrently, and at least one of those drives is replaced, multiple node warmstarts may be seen due to APAR HU01792.

This issue only occurs on systems running 7.8.1.5, 8.1.1.1 or 8.1.2.0 software.

Content

Recovery from a failed drive in a Distributed RAID array consists of two phases: rebuild (where data is automatically rewritten to rebuild areas on other drives in the array), and copyback (where that data is copied to a new drive, after the failed drive is physically replaced).

If more drives have failed than there are rebuild areas in the array, the copyback operates in a degraded mode. In affected software versions, this degraded copyback will fail, leading to multiple node warmstarts and temporary loss of access to data.

Fix

Systems running affected software versions, and using Distributed RAID should be upgraded to 7.8.1.6, 8.1.1.2 or 8.1.2.1 to prevent this issue.

Workaround

Until the system is upgraded, exercise care in replacing failed drives in a Distributed RAID array.

The GUI shows a "Rebuild Areas total" value for each array.

  • If the number of failed drives in the array is less than the rebuild areas total, the drive can be replaced as normal.
  • If the number of failed drives in the array is equal to, or greater than, the rebuild areas total, then urgently upgrade the software to a fixed version before replacing the drive. When the upgrade has completed, the drive can be replaced without risk of triggering this issue.