Failed Disk Replacement with Navigator Encrypt

Hardware fails.  Especially hard disks.  Your Hadoop cluster will be operating with less capacity until that failed disk is replaced.  Using full disk encryption adds to the replacement trouble.  Here is how to do it without bringing down the entire machine (assuming of course that your disk is hot swappable).

Assumptions:

  • Cloudera Hadoop and/or Cloudera Kafka environment.
  • Cloudera Manager is in use.
  • Cloudera Navigator Encrypt is in use.
  • Physical hardware that will allow for a data disk to be hot swapped without powering down the entire machine. Otherwise you can pretty much skip steps 2 and 4.
  • We are replacing a data disk and not an OS disk.

Steps:

The following are steps to replace a failed disk that is encrypted by Cloudera Navigator Encrypt.   If any of the settings are missing in your Cloudera Manager (CM), you might consider upgrading CM to a newer version.

  1. Determine the failed disk.  The example used here is a disk that is mounted at /data/0.
  2. Configure data directories to remove the disk you are swapping out:
    1. HDFS
      1. Go to the HDFS service.
      2. Click the Instances tab.
      3. Click the affected DataNode.
      4. Click the Configuration tab.
      5. Select Category > Main.
      6. Change the value of the DataNode Data Directory property to remove the directories that are mount points for the disk you are removing.

        Warning: Change the value of this property only for the specific DataNode instance where you are planning to hot swap the disk. Do not edit the role group value for this property. Doing so will cause data loss.

      7. Click Save Changes to commit the changes.
      8. Refresh the affected DataNode. Select Actions > Refresh DataNode configuration.
    2. YARN
      1. Go to the YARN service.
      2. Click the Instances tab.
      3. Click the affected NodeManager.
      4. Click the Configuration tab.
      5. Select Category > Main.
      6. Change the value of the NodeManager Local Directories property to remove the directories that are mount points for the disk you are removing.

        Warning: Change the value of this property only for the specific NodeManager instance where you are planning to hot swap the disk. Do not edit the role group value for this property. Doing so will cause data loss.

      7. Change the value of the NodeManager Container Log Directories property to remove the directories that are mount points for the disk you are removing.

        Warning: Change the value of this property only for the specific NodeManager instance where you are planning to hot swap the disk. Do not edit the role group value for this property. Doing so will cause data loss.

      8. Click Save Changes to commit the changes.
      9. Refresh the affected NodeManager. Select Actions > Refresh NodeManager.
    3. Impala
      1. Go to the Impala service.
      2. Click the Instances tab.
      3. Click the affected Impala Daemon.
      4. Click the Configuration tab.
      5. Select Category > Main.
      6. Change the value of the Impala Daemon Scratch Directories property to remove the directories that are mount points for the disk you are removing.

        Warning: Change the value of this property only for the specific Impala Daemon instance where you are planning to hot swap the disk. Do not edit the role group value for this property. Doing so will cause data loss.

      7. Click Save Changes to commit the changes.
      8. Refresh the affected Impala Daemon. Select Actions > Refresh the Impala Daemon.
    4. Kafka
      1. Go to the Kafka service.
      2. Click the Instances tab.
      3. Click the affected Kafka Broker.
      4. Click the Configuration tab.
      5. Select Category > Main.
      6. Change the value of the Log Directories property to remove the directories that are mount points for the disk you are removing.

        Warning: Change the value of this property only for the specific Kafka Broker instance where you are planning to hot swap the disk. Do not edit the role group value for this property. Doing so will cause data loss.

      7. Click Save Changes to commit the changes.
      8. Refresh the affected Kafka Broker. Select Actions > Refresh Kafka Broker.
  3. Remove the old disk and add the replacement disk.
    1. List out the disks in the system, taking note of the name of the failed disk. (lsblk; lsscsi)
    2. Determine the failed disk.  Example used here is /data/0 which is mounted at /navencrypt/0.  (readlink -f /data/0)
    3. Determine the Navigator Encrypt DISKID of the failed source device. (grep /navencrypt/0 /etc/navencrypt/ztab)
    4. Clean up Navigator Encrypt entries. (navencrypt-prepare --undo ${DISKID} || navencrypt-prepare --undo-force ${DISKID})
      1. Also possibly need to use: (cryptsetup luksClose /dev/mapper/0; dd if=/dev/zero of=${DISK}1 ibs=1M count=1)
    5. Remove failed disk.
    6. Add replacement disk.
    7. Perform any HBA configuration (i.e. Dell PERC/HP SmartArray RAID0 machinations).
    8. Determine the name of the new disk.  Example used here is /dev/sdo. (lsblk; lsscsi)
    9. Partition the replacement disk. (parted -s ${DISK} mklabel gpt mkpart primary xfs 1 100%)
    10. Have Navigator Encrypt configure the disk for encryption and write out a new filesystem. (navencrypt-prepare -t xfs -o noatime --use-uuid ${DISK}1 /navencrypt/0)
    11. Fix the symlink target directory installed by navencrypt-move. (mkdir -p $(readlink -f /data/0))
  4. Configure data directories to restore the disk you have swapped in:
    1. HDFS
      1. Change the value of the DataNode Data Directory property to add back the directory that is the mount point for the disk you added.
      2. Click Save Changes to commit the changes.
      3. Refresh the affected DataNode. Select Actions > Refresh DataNode configuration.
      4. Run the HDFS fsck utility to validate the health of HDFS.
    2. YARN
      1. Change the value of the NodeManager Local Directories and NodeManager Container Log Directories properties to add back the directory that is the mount point for the disk you added.
      2. Click Save Changes to commit the changes.
      3. Refresh the affected DataNode. Select Actions > Refresh NodeManager.
    3. Impala
      1. Change the value of the Impala Daemon Scratch Directories property to add back the directory that is the mount point for the disk you added.
      2. Click Save Changes to commit the changes.
      3. Refresh the affected DataNode. Select Actions > Refresh the Impala Daemon.
    4. Kafka
      1. Change the value of the Log Directories property to add back the directory that is the mount point for the disk you added.
      2. Click Save Changes to commit the changes.
      3. Refresh the affected Kafka Broker. Select Actions > Refresh Kafka Broker.

Reference Links:

https://www.cloudera.com/documentation/enterprise/latest/topics/admin_dn_swap.html

https://www.cloudera.com/documentation/enterprise/latest/topics/navigator_encrypt_prepare.html#concept_device_uuids

About Michael Arnold
This is where I write about all of my unix hacking experiences so that you may be able to learn from my troubles.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: