Troubleshooting: Disk(s) taken offline

Symptoms

One or more disks were taken offline by ASM

Solution

  1. Bring the disk(s) online as soon as possible to restore full redundancy (unless the problem is repeating after onlining the disks)
    • to online disks attached to a particular node: flashgrid-node online as user grid@
    • to online a particular disk: asmcmd online -G DGNAME -D 'HOSTNAME$DISKNAME' as user grid@
    • to online all disks in a disk group: asmcmd online -G DGNAME -a as user grid@
  2. Wait for the disk group resync to complete and confirm that the disk stays online by running flashgrid-dg
  3. Determine the cause of the problem to avoid it happening again
    • Check /opt/flashgrid-diags/logs/flashgrid-node-monitor-all.log on the node that the disk belongs for clues why the disk could go offline.
    • Use the table below to determine the likely cause of the disk(s) offline.
    • Collect cluster diags (flashgrid-diags --all), upload to FlashGrid support, and contact FlashGrid support with description of the problem.

Determining the cause of the problem

Several failure types may cause ASM disk(s) to be taken offline. The table below together with the log files can help with determining the exact cause of the problem.

Symptoms Possible causes
A single disk is offline Disk problem (may be transient)
Stability problem on the node where the disk is attached
Network disruption (less likely)
Multiple disks belonging to the same node are offline, but some disks on the same node are online Stability problem on the node where the offline disks are attached
Network disruption (less likely)
All disks belonging to the same node are offline Stability problem on the node where the offline disks are attached
Network disruption
Multiple disks belonging to 2 or more different nodes are offline Stability problem on one of the nodes
Network disruption

Notes:

  • A single disk going offline may be caused by a transient error that will not repeat. However, if the same disk is taken offline more than once then the disk may be damaged and might need replacement.
  • Stability problem on a node may be caused by heavy swapping, out of memory, excessive CPU load, or kernel problems that do not result in immediate node crash, but severly impact performance.