SkyCluster Maintenance Tasks in Azure


Rebooting one node

Note: Do not use this procedure if you need to restart the entire cluster. Instead, see instructions for restarting the entire cluster.

To reboot one node in a running cluster

  1. Make sure there are no other nodes that are in offline or re-syncing state. All disk groups must have zero offline disks and Resync = No.

    # flashgrid-cluster
  2. If the node is a database node, stop all local database instances running on the node.
  3. Reboot the node using flashgrid-node command. It will gracefully put the corresponding failure group offline.

    # flashgrid-node reboot
  4. After the nodes boots up, wait until re-synchronization completes for all disk groups before rebooting or powering off any other node.

Restarting the entire cluster

In some cases it may be desirable to restart all nodes of the cluster simultaneously instead of rebooting them one by one.

Note: Do not reboot all nodes simultaneously using reboot or flashgrid-node reboot command. This may lead to CRS failure to start if one node goes down when CRS is already starting on another node.

To restart the entire cluster down

  1. Stop all running databases.

  2. Stop Oracle cluster services on all nodes.

    # crsctl stop cluster -all

  3. Stop all cluster node VMs using Azure console.

  4. Start all nodes

Powering off one node

To power off one node in a running cluster

  1. Make sure there are no other nodes that are in offline or re-syncing state. All disk groups must have zero offline disks and Resync = No.

    # flashgrid-cluster
  2. If the node is a database node, stop all local database instances running on the node.
  3. Stop FalshGrid services on the node. It will gracefully put the corresponding failure group offline, stop CRS, and stop Flashgrid services.

    # flashgrid-node stop
  4. Stop the node VM using Azure console.
  5. After powering up the node, wait until re-synchronization completes for all disk groups before rebooting or powering off any other node.

Shutting down the entire cluster

To shut the entire cluster down

  1. Stop all running databases.
  2. Stop Oracle cluster services on all nodes.

    # crsctl stop cluster -all
  3. Stop all cluster node VMs using Azure console.

Resizing database node VMs

Resizing database node VMs may be needed for performance or cost reasons. Resizing can be done for one node at a time without causing database downtime.

To resize database node VMs in a running cluster repeat the following steps on each database node, one node at a time

  1. Update SGA and PGA sizing parameters for the databases according to the new VM memory size

  2. Skip this step unless you have vm.nr_hugepages parameter in /etc/sysctl.conf manually configured. If you have it manually configured then update the parameters according to the new VM size. Note that starting with Storage Fabric 19.02 HugePages are configured automatically by default and manual change is not required.

  3. Make sure there are no other nodes that are in offline or re-syncing state. All disk groups must have zero offline disks and Resync = No:

    # flashgrid-cluster
  4. Stop all local database instances running on the node.

  5. Stop Oracle CRS on the node:

    # crsctl stop crs
  6. Stop the FlashGrid Storage Fabric services on the node:

    # flashgrid-node stop
  7. Stop the VM using Azure console

  8. Resize the VM using Azure console

  9. Start the VM using Azure console

  10. Wait until all disks are back online and resyncing operations complete on all disk groups. All disk groups must have zero offline disks and Resync = No.

    # flashgrid-cluster
  11. Start all database instances on the node

  12. Proceed to the next node

Adding disks for use in ASM

When adding new disks make sure that each disk group has disks of the same size and that the number of disks per node is the same.

To add new disks in a running cluster

  1. Create and attach new disks to the database node VMs. Attach the disks using LUN numbers 1 through 49 - these LUNs will be shared automatically by FlashGrid Storage Fabric.

    Note: Read-only caching must be enabled for all new disks. Read-Write and None modes are not supported and may create reliability problems.

  2. Confirm FlashGrid names of the new drives, e.g. rac2.lun3

    $ flashgrid-cluster drives

    If the new drives are not listed then check that the corresponding devices (e.g. /dev/lun3) are visible in the OS. If they are visible in the OS then run # flashgrid-node reload-config and check output of flashgrid-cluster drives again. If they are not visible in the OS then double-check that you have attached them with correct LUN numbers.

  3. Add the new disks to an existing disk group (or create a new disk group).

    Example A (adding 2 disks rac1.lun3 and rac2.lun3):

    $ flashgrid-dg add-disks -G MYDG -d /dev/flashgrid/rac1.lun3 /dev/flashgrid/rac2.lun3

    Example B (using wildcards for adding 6 disks lun3/lun4/lun5 on rac1/rac2 nodes):

    $ flashgrid-dg add-disks -G MYDG -d /dev/flashgrid/rac[12].lun[3-5]

Removing disks

To remove disks from a running cluster

  1. Determine FlashGrid names of the drives to be removed, e.g. rac2.lun3:

    $ flashgrid-cluster drives
  2. If the drives are members of an ASM disk group then drop the drives from the disk group. Example:
     SQL> alter diskgroup MYDG
     drop disk RAC1$LUN3
     drop disk RAC2$LUN3
     rebalance wait;
  3. Prepare the disks for removal. Example:

    [fg@rac1 ~] $ sudo flashgrid-node stop-target /dev/flashgrid/rac1.lun3
    [fg@rac2 ~] $ sudo flashgrid-node stop-target /dev/flashgrid/rac2.lun3
  4. Detach the disks from the VMs.

Re-adding a lost disk

ASM will drop a disk from a disk group if the disk stays offline for longer than the disk repair time. If the disk was taken offline because of an intermittent problem, for example a network problem, then you can re-add such disk to the disk group. Force option must be used for re-adding such disk because it already contains ASM metadata.

Example of re-adding a regular disk:

   # flashgrid-dg add-disks -G MYDG -d /dev/flashgrid/rac2.lun3 -f

Example of re-adding a quorum disk:

   # flashgrid-dg add-disks -G MYDG -q /dev/flashgrid/racq.lun2 -f

Updating FlashGrid software and Linux kernel using SkyCluster Node Update package

SkyCluster Node Update package is a single self-extracting bash script file that allows updating the following components:

  • FlashGrid Storage Fabric RPM
  • FlashGrid Diagnostics RPM
  • FlashGrid Cloud Area Network RPM
  • Linux kernel

Using this package makes it easier to have the update performed to the latest validated set of software components and helps avoid accidental installation of incompatible software versions.

Note: Please review corresponding release notes and check with FlashGrid support before performing any major version update. A major version consists of the first two numbers. The third number represents a revision (hotfix). For example, update from version 19.02.x to 19.05.x is major, but from 19.05.100 to 19.05.200 is a hotfix revision.

Note: Simultaneously updating FlashGrid software and applying Grid Infrastructure patches in rolling fashion is not recommended. FlashGrid services should not be stopped while GI cluster is in rolling patching mode.

To update software using SkyCluster Node Update package on a running cluster, repeat the following steps on each node, one node at a time

  1. Make sure there are no other nodes that are in offline or re-syncing state. All disk groups must have zero offline disks and Resync = No:

    # flashgrid-cluster
  2. If the node is a database node,

    a. Stop all local database instances running on the node.

    b. Stop Oracle CRS on the node:

    # crsctl stop crs
  3. Stop the FlashGrid Diagnostics monitoring service:

    # systemctl stop flashgrid-node-monitor
  4. Stop the FlashGrid Storage Fabric services on the node:

    # flashgrid-node stop
  5. Stop the FlashGrid Cloud Area Network service on the node:

    # systemctl stop flashgrid-clan
  6. Run the update script as root. Example:

    # bash skycluster_node_update-19.5.17.85011.sh
  7. Reboot the node:

    # reboot
  8. Before updating the next node, wait until the node boots up, all disks are back online, and resyncing operations complete on all disk groups. All disk groups must have zero offline disks and Resync = No before it is safe to update the next node.

    # flashgrid-cluster

Updating FlashGrid software RPMs

Note: In most cases using SkyCluster Node Update package is recommended for updating FlashGrid software and OS kernel.

Note: Please review corresponding release notes and check with FlashGrid support before performing any major version update. A major version consists of the first two numbers. The third number represents a revision (hotfix). For example, update from version 19.02.x to 19.05.x is major, but from 19.05.100 to 19.05.200 is a hotfix revision.

Note: Simultaneously updating FlashGrid software and applying Grid Infrastructure patches in rolling fashion is not recommended. FlashGrid services should not be stopped while GI cluster is in rolling patching mode.

To update flashgrid-sf, flashgrid-diags, and flashgrid-clan RPMs on a running cluster repeat the following steps on each node, one node at a time

  1. Make sure there are no other nodes that are in offline or re-syncing state. All disk groups must have zero offline disks and Resync = No:

    # flashgrid-cluster
  2. If the node is a database node,

    a. Stop all local database instances running on the node.

    b. Stop Oracle CRS on the node:

    # crsctl stop crs
  3. Stop the FlashGrid Storage Fabric services on the node:

    # flashgrid-node stop
  4. Stop the FlashGrid Cloud Area Network service on the node:

    # systemctl stop flashgrid-clan
  5. Update the flashgrid-sf, flashgrid-clan, flashgrid-diags RPMs on the node using yum or rpm tool.
  6. Start the FlashGrid Cloud Area Network service on the node:

    # systemctl start flashgrid-clan
  7. Start the FlashGrid Storage Fabric service on the node:

    # flashgrid-node start
  8. Restart the FlashGrid Monitoring service on the node:

    # systemctl restart flashgrid-node-monitor
  9. If the node is a database node then start Oracle services on the node:
    # systemctl start oracle-ohasd
    # crsctl start crs -wait
  10. Wait until all disks are back online and resyncing operations complete on all disk groups before updating the next node. All disk groups must have zero offline disks and Resync = No.

    # flashgrid-cluster

To update only flashgrid-diags RPM on a node

  1. Update the flashgrid-diags RPM using yum or rpm tool

  2. Restart the flashgrid-node-monitor service (this can be done without stopping any Oracle services):
    # systemctl restart flashgrid-node-monitor

Updating OS

Note: Simultaneously updating OS and applying Grid Infrastructure patches in rolling fashion is not recommended. Nodes should not be rebooted while GI cluster is in rolling patching mode.

Note: Running yum update without first stopping Oracle and FlashGrid services may result in the services restarting non-gracefully during the update.

To update OS on a running cluster repeat the following steps on each node, one node at a time

  1. Make sure there are no other nodes that are in offline or re-syncing state. All disk groups must have zero offline disks and Resync = No:

    # flashgrid-cluster
  2. If the node is a database node,

    a. Stop all local database instances running on the node.

    b. Stop Oracle CRS on the node:

    # crsctl stop crs
  3. Stop FlashGrid Storage Fabric services on the node:

    # flashgrid-node stop
  4. Install OS updates:

    # yum update
  5. Reboot the node

    # reboot
  6. Before updating the next node, wait until the node boots up, all disks are back online, and resyncing operations complete on all disk groups. All disk groups must have zero offline disks and Resync = No before it is safe to update the next node.

    # flashgrid-cluster

Applying Grid Infrastructure and Database patches

For applying single patches or Release Updates / Patch Set Updates to Grid Infrastructure or Database homes follow standard procedures documented by Oracle.

Note: While GI cluster is in rolling patching mode, do not reboot any nodes and do not stop FlashGrid services. Updating OS or FlashGrid software simultaneously with applying Grid Infrastructure patches in rolling fashion is not recommended.

Note: Before applying the latest Release Update from Oracle, we recommend to request confirmation from FlashGrid support . FlashGrid performs validation of every Relese Update to minimize risk of compatibility or reliability issues. Typical time to complete the validation is 3 weeks after the Release Update is publicly available.