SkyCluster Maintenance Tasks in Azure


Rebooting a node

To reboot a node in a running cluster

  1. Make sure there are no other nodes that are in offline or re-syncing state. All disk groups must have zero offline disks and Resync = No.

    # flashgrid-cluster
  2. If the node is a database node, stop all local database instances running on the node.
  3. Reboot the node using flashgrid-node command. It will gracefully put the corresponding failure group offline.

    # flashgrid-node reboot
  4. After the nodes boots up, wait until re-synchronization completes for all disk groups before rebooting or powering off any other node.

Powering off a node

To power off a node in a running cluster

  1. Make sure there are no other nodes that are in offline or re-syncing state. All disk groups must have zero offline disks and Resync = No.

    # flashgrid-cluster
  2. If the node is a database node, stop all local database instances running on the node.
  3. Stop FalshGrid services on the node. It will gracefully put the corresponding failure group offline, stop CRS, and stop Flashgrid services.

    # flashgrid-node stop
  4. Stop the node VM using Azure console.
  5. After powering up the node, wait until re-synchronization completes for all disk groups before rebooting or powering off any other node.

Shutting down an entire cluster

To shut an entire cluster down

  1. Stop all running databases.
  2. Stop Oracle cluster services on all nodes.

    # crsctl stop cluster -all
  3. Stop all cluster node VMs using Azure console.

Adding disks for use in ASM

When adding new disks make sure that each disk group has disks of the same size and that the number of disks per node is the same.

To add new disks in a running cluster

  1. Create and attach new disks to the database node VMs. Attach the disks using LUN numbers 1 through 40 - these LUNs will be shared automatically by FlashGrid Storage Fabric.

    Note: Read-only caching must be enabled for all new disks. Read-Write and None modes are not supported and may create reliability problems.

  2. Confirm FlashGrid names of the new drives, e.g. rac2.lun3

    $ flashgrid-cluster drives
  3. Add the new disks to an existing disk group (or create a new disk group). Example:

    $ flashgrid-dg add-disks -G MYDG -d /dev/flashgrid/rac[12].lun[3-5]

Removing disks

To remove disks from a running cluster

  1. Determine FlashGrid names of the drives to be removed, e.g. rac2.lun3:

    $ flashgrid-cluster drives
  2. If the drives are members of an ASM disk group then drop the drives from the disk group. Example:
     SQL> alter diskgroup MYDG
     drop disk RAC1$LUN3
     drop disk RAC2$LUN3
     rebalance wait;
  3. Prepare the disks for removal. Example:

    [fg@rac1 ~] $ sudo flashgrid-node stop-target /dev/flashgrid/rac1.lun3
    [fg@rac2 ~] $ sudo flashgrid-node stop-target /dev/flashgrid/rac2.lun3
  4. Detach the disks from the VMs.

Re-adding a lost disk

ASM will drop a disk from a disk group if the disk stays offline for longer than the disk repair time. If the disk was taken offline because of an intermittent problem, for example a network problem, then you can re-add such disk to the disk group. Force option must be used for re-adding such disk because it already contains ASM metadata.

Example of re-adding a regular disk:

   # flashgrid-dg add-disks -G MYDG -d /dev/flashgrid/rac2.lun3 -f

Example of re-adding a quorum disk:

   # flashgrid-dg add-disks -G MYDG -q /dev/flashgrid/racq.lun2 -f

Updating FlashGrid software

The following procedure applies to minor updates. Minor updates are those that have the same first two numbers in the version number, for example, from 17.05.31 to 17.05.50. However, update from 17.05 to 17.10 is considered major and may require a different procedure. Contact FlashGrid support for assistance if you need to do a major version update.

To update flashgrid-sf and/or flashgrid-clan RPMs on a running cluster repeat the following steps on each node, one node at a time

  1. Make sure there are no other nodes that are in offline or re-syncing state. All disk groups must have zero offline disks and Resync = No:

    # flashgrid-cluster
  2. If the node is a database node,

    a. Stop all local database instances running on the node.

    b. Stop Oracle CRS on the node:

    # crsctl stop crs
  3. Stop the FlashGrid Storage Fabric services on the node:

    # flashgrid-node stop
  4. Stop the FlashGrid Cloud Area Network service on the node:

    # systemctl stop flashgrid-clan
  5. Update the flashgrid-sf, flashgrid-clan, flashgrid-diags RPMs on the node using yum or rpm tool.
  6. Start the FlashGrid Cloud Area Network service on the node:

    # systemctl start flashgrid-clan
  7. Start the FlashGrid Storage Fabric service on the node:

    # flashgrid-node start
  8. Restart the FlashGrid Monitoring service on the node:

    # systemctl restart flashgrid-node-monitor
  9. If the node is a database node then start Oracle services on the node:
    # systemctl start oracle-ohasd
    # crsctl start crs -wait
  10. Wait until all disks are back online and resyncing operations complete on all disk groups before updating the next node. All disk groups must have zero offline disks and Resync = No.

    # flashgrid-cluster

To update flashgrid-diags RPM on a node

  1. Update the flashgrid-diags RPM using yum or rpm tool

  2. Restart the flashgrid-node-monitor service (this can be done without stopping any Oracle services):
    # systemctl restart flashgrid-node-monitor

Updating Linux kernel

To update Linux kernel on a running cluster repeat the following steps on each node, one node at a time

  1. Install the new kernel and the corresponding kernel-devel package
    # yum update kernel kernel-devel
  2. Install LIS kmod package for the new kernel
  3. Follow the steps for rebooting a node