Slides and Video from My Talk at PDC 2019

I got a chance to speak to Hadoop folks at this years Pune Data Conference held in Pune, India.

My talk is titled Admins: Smoke Test Your Hadoop Cluster! This is the abstract:

Software smoke testing is a preliminary level of testing. It makes certain that all of the primary components of a system are functioning correctly. For example, when installing a new secured Hadoop cluster, running a series of quick tests to make sure that things like HDFS and MapReduce are operational can save a lot of headache before enabling Kerberos. Smoke tests can also save you time and embarrassment by making sure that things work before you turn the cluster over to your customer.

In this talk, Michael Arnold will explain the utility of testing Hadoop components after cluster builds and software upgrades. Michael will present code examples that you can use to confirm functionality of Spark, Kudu, HBase, Kafka, MapReduce, etc on your cluster.

This is the link to the slide presentation and video.

Things to Come From the Cloudera/Hortonworks Merger

Now that the two Hadoop distribution giants have merged, it is time to call out what will happen to their overlapping software offerings. The following are my predictions:

Ambari is out – replaced by Cloudera Manager.
This is a no-brainer for anyone that has used the two tools. People can rant and rave about open source and freedom all they want, but Cloudera Manager is light-years ahead of Ambari in terms of functionality and features. I mean, Ambari can only deploy a single cluster. CM can deploy multiple clusters. And the two features I personally use the most in my job as a consultant are nowhere to be found in Ambari: Host/Role layout and a non-default Configuration view.

Tez is out – replaced by Spark.
Cloudera has already declared that Spark has replaced MapReduce. There is little reason for Tez to remain as a Hive execution engine when Spark does the same things and can also be used for general computation outside of Hive.

Hive LLAP is out – replaced by Impala.
Similar to Tez, there is no reason to keep interactive query performance tools for Hive around when Impala was designed to do just that. Remember: Hive is for batch and Impala is for exploration.

What do you think? Leave your thoughts in the comments.

Hadoop Cluster Sizes

A few years ago, I presented Hadoop Operations: Starting Out Small / So Your Cluster Isn’t Yahoo-sized (yet) at a conference. It included a definition of Hadoop cluster sizes. I am posting those words here to ease future references to that definition.

Question: What is a tiny/small/medium/large [Hadoop] cluster?


  • Tiny: 1-9 nodes
  • Small: 10-99 nodes
  • Medium: 100-999 nodes
  • Large: 1000+ nodes
  • Yahoo-sized: 4000 nodes

Failed Disk Replacement with Navigator Encrypt

Hardware fails.  Especially hard disks.  Your Hadoop cluster will be operating with less capacity until that failed disk is replaced.  Using full disk encryption adds to the replacement trouble.  Here is how to do it without bringing down the entire machine (assuming of course that your disk is hot swappable).


  • Cloudera Hadoop and/or Cloudera Kafka environment.
  • Cloudera Manager is in use.
  • Cloudera Navigator Encrypt is in use.
  • Physical hardware that will allow for a data disk to be hot swapped without powering down the entire machine. Otherwise you can pretty much skip steps 2 and 4.
  • We are replacing a data disk and not an OS disk.

Read more of this post

puppet cloudera module 3.0.0

This is a major release of my Puppet module to deploy Cloudera Manager. The major change is that razorsedge/cloudera now supports the latest releases of dependent modules. razorsedge/cloudera was lagging behind due to the need to support Puppet Enterprise 3.0.1 installations and only recently did those installations finally upgrade.

Notable changes are:

Let me know if you have any feedback!

puppet cloudera module 2.0.2

This is a minor bugfix release of my Puppet module to deploy Cloudera Manager. When I released the module, I had assumed that the testing I did for the C5 beta2 would be 100% valid for C5 GA.  It turns out that Cloudera shipped a newer version of the Oracle 7 JDK and a symlink that the module creates on RedHat and Suse (/usr/java/default) was pointing at the wrong location.  Upgrading to razorsedge/cloudera 2.0.2 will fix the issue.

Lesson learned: Test, test, and test some more.

Thanks to yuzi-co for reporting the problem.

Let me know if you have any feedback!

puppet cloudera module 2.0.1

This is a major release of my Puppet module to deploy Cloudera Manager. The major change is that razorsedge/cloudera now supports Cloudera’s latest release, Cloudera Enterprise 5, which adds support for Cloudera Manager 5 and Cloudera’s Distribution of Apache Hadoop (CDH) 5. Additionally, this module and it’s deployment via Puppet Enterprise 3.2 has been certified by Cloudera to be tested and validated to work with Cloudera Enterprise 5.

Cloudera Certified This module is certified on Cloudera 5.

Other changes are:

  • All interaction with the cloudera module can now be done through the main ::cloudera class, including installation of the CM server. This means you can simply toggle the options in ::cloudera to have full functionality of the module.
  • Official operating system support for Debian 7.
  • Installation of Oracle JDK 7.
  • Recommended tuning of the vm.swappiness kernel parameter.
  • Installation of native LZO libraries when the parameter install_lzo => true is selected, even when installing via parcels.
  • Conversion of the file to the Puppet Labs recommended README.markdown formatting.  This has dramatically improved the presentation of the things one needs to know about the module in order to quickly become productive.
  • Taking advantage of the new module metadata to add compatability information to the module page on the Puppet Forge.

If you have not seen the previous changes in version 1.0.1, here is a recap:

  • Allow for use of an external Java module. Not everyone will want to stick with the older version Oracle JDK that Cloudera ships in their software repositories. If you have a module that provides the Oracle JDK and sets $JAVA_HOME in the environment, then just set install_java => false in Class['cloudera'] and make sure the JDK is installed before calling Class['cloudera'].
  • Integrated installation of the Oracle Java Cryptography Extension (JCE) unlimited strength jurisdiction policy files. Set the parameter install_jce => true in Class['cloudera'] .

Deprecation Warnings

  • The class parameters and variables yumserver and yumpath have been renamed to reposerver and repopath respectively. This makes the name more generic as it applies to APT and Zypprepo as well as YUM package repositories.
  • The use_gplextras parameter has been renamed to install_lzo.

One note of mention is that this module does not support upgrading from CDH4 to CDH5 packages, including Impala, Search, and GPL Extras.

Let me know if you have any feedback!