Things to Come From the Cloudera/Hortonworks Merger

Now that the two Hadoop distribution giants have merged, it is time to call out what will happen to their overlapping software offerings. The following are my predictions:

Ambari is out – replaced by Cloudera Manager.
This is a no-brainer for anyone that has used the two tools. People can rant and rave about open source and freedom all they want, but Cloudera Manager is light-years ahead of Ambari in terms of functionality and features. I mean, Ambari can only deploy a single cluster. CM can deploy multiple clusters. And the two features I personally use the most in my job as a consultant are nowhere to be found in Ambari: Host/Role layout and a non-default Configuration view.

Tez is out – replaced by Spark.
Cloudera has already declared that Spark has replaced MapReduce. There is little reason for Tez to remain as a Hive execution engine when Spark does the same things and can also be used for general computation outside of Hive.

Hive LLAP is out – replaced by Impala.
Similar to Tez, there is no reason to keep interactive query performance tools for Hive around when Impala was designed to do just that. Remember: Hive is for batch and Impala is for exploration.

What do you think? Leave your thoughts in the comments.

Hadoop Cluster Sizes

A few years ago, I presented Hadoop Operations: Starting Out Small / So Your Cluster Isn’t Yahoo-sized (yet) at a conference. It included a definition of Hadoop cluster sizes. I am posting those words here to ease future references to that definition.

Question: What is a tiny/small/medium/large [Hadoop] cluster?

Answer:

  • Tiny: 1-9 nodes
  • Small: 10-99 nodes
  • Medium: 100-999 nodes
  • Large: 1000+ nodes
  • Yahoo-sized: 4000 nodes