Becoming a Rock Star Hadoop Administrator, Part 1

rockstar152994443Hadoop is just storage and computing, so administering a Hadoop cluster should be a breeze, right? Well… not necessarily. When we’re talking about Hadoop, we’re talking about a fast-moving open source project that covers many disciplines and requires deep understanding of Linux, Java, and other ecosystem projects with funny names like ZooKeeper, Flume, and Sqoop. Fear not, in these posts, we hope to help you on your journey to becoming a rock star Hadoop administrator.

The Hadoop Administrator Challenge

From an end user perspective, Hadoop is deceptively simple – your developers can go from nothing to a functioning program in just a couple dozen keystrokes. With interfaces like Hive and Pig (and now Impala), end users can do useful work even without knowledge of the inner workings of HDFS or MapReduce.

Administrators, however, are not so lucky.

In order to properly set up and maintain a cluster at optimum performance, Hadoop admins need to have a decent understanding of (deep breath):  Linux, HDFS, and MapReduce daemons (with two flavors of the latter and two ways to set up the former), Java (especially JVM configuration and tuning), and a handful of Hadoop ecosystem projects.

On top of all that, admins need tools to configure and manage dozens, hundreds, or thousands of independent servers from both a hardware and software perspective.

Perhaps the biggest challenge of such a system is perhaps the fast-moving open source nature of the core software itself. New versions of the software quickly make the “old way” of doing things obsolete and inefficient.

The Basics of Hadoop Hardware

As an administrator, you’re primary responsibility is configuring and maintaining your cluster for optimal performance. From a hardware perspective, this means choosing the best machines you can afford (dual bonded NICs, RAIDed drives, redundant power supplies, carrier class everything) for your “master” machines (which host your NameNode and JobTracker daemons), and affordable machines (JBOD disks, single NIC, power supplies, commodity class everything) for your slave nodes (which run your DataNode and TaskTracker daemons, in addition to Map and Reduce JVMs). For both, you probably want to get machines with some growth capacity left in them (allowing you to upgrade the hardware later on and extend their lifetimes). A good rule of thumb is that master machines should cost you no more than $10-12,000/machine, and slave nodes should be roughly half of that.

After you’ve got your hardware running, you’ll need to configure various parameters around your HDFS and MapReduce daemons, including things like JVM heap size, number of map and reduce slots, number of space for HDFS, and intermediate data.

Once you’re up and running, chances are you’re quickly going to need to both tweak existing hardware and software, as well as add new slave nodes (and possibly master nodes). Following best practices for each will save you tons of time and save your company tens or hundreds of thousands of dollars.

Best Practices: CDH4 and Cloudera Manager

First and foremost, when deploying your cluster, it makes really no sense to use anything but what nearly everyone in the industry is using, Cloudera’s Distribution of Apache Hadoop, version 4 (CDH4).

With CDH4, Cloudera has packaged and bundled not only base Hadoop, but also a couple hundred patches to Hadoop as well as over a dozen of the commonly used ecosystem projects such as Hive, Pig, Ooozie, Flume, and Sqoop (all pre-configured to work in concert together straight out of the gate).

With CDH4, you get Hadoop up and running in just a few minutes and with a high confidence factor that it’s all set up correctly for most use cases. Just as you can download Linux and build your own distribution, you can download base Hadoop, and build it from scratch (assuming you have a couple weeks and Hadoop experts in house who can choose and install the proper patches and ecosystem projects). For 99.9% of the organizations out there, CDH4 just makes the most sense.

It’s worth mentioning two things here, CDH4 is 100% open source and in no way locks you into Cloudera as a provider, and Cloudera is not the only packager of Hadoop. Hortonworks, a Yahoo-backed Cloudera competitor also packages up Hadoop in their own version, which again is 100% open source.

When one closely compares Cloudera and Hortonworks offerings, they find that Cloudera is more widely deployed, has more influential contributions (Impala, Flume, Sqoop, etc) to Hadoop, and employs more core members of the Hadoop community, including Doug Cutting, Tom White, Eric Sammer, Lars George, and Jeff Hammerbacher. I think competition is great and would love to be able to say that Cloudera and Hortonworks are equal, but for now, Cloudera has the history, the deployment base, and, most importantly, the core members of the community.

In addition to giving you a very easy way to get Hadoop up and running, Cloudera also provides an excellent GUI-based maintenance tool called Cloudera Manager. Cloudera Manager lets you easily control the arduous tasks of configuring and managing hundreds or thousands of independent servers in your cluster. It is 100% free to use, and since Cloudera lifted the previous “50 nodes for free” restriction, it can now be used completely free of charge for clusters of any size.

Cloudera Manager can also easily be disabled later should you choose to use another service or homegrown solution to manage your cluster (although it’s hard to argue why you’d do that).

Related Posts:
Become a Rock Star Hadoop Developer
Using Hadoop Like a Boss
Beyond the Buzz: Big Data and Apache Hadoop

Related Courses:
Cloudera Essentials for Apache Hadoop
Cloudera Administrator Training for Apache Hadoop
Cloudera Training for Apache Hive and Pig

In this article

Join the Conversation

4 comments

  1. Mark Johnson Reply

    “Cloudera Manager can also easily be disabled later”

    I believe Cloudera Manager is required to start and stop services in CDH4 as well as modify configurations.

  2. rICh morrow Reply

    Hi Mark,

    Cloudera Manager definitely makes it *easier* to start/stop services & manage configs, but the “can be disabled” comment was clarifying that you don’t have to use it — One can install CDH4 from tarballs or packages and then just do manual config mods & start/stop of daemons.

    One could also use Cloudera Manager to just install, then turn off the CM webserver & go back to doing everything in tedious, manual fashion. Global Knowledge is sponsoring a free Webinar on Cloudera Manager next month (Sep 26th), would love to have you join.

    http://www.globalknowledge.com/training/coursewebsem.asp?pageid=9&courseid=20221&catid=248&country=United+States

    Happy Hacking,
    -r

  3. Abbas Ali Zaidi Reply

    Hi,

    I work as a Systems Integrator in IPTV domain from past two years. The work of a Systems Integrator in IPTV domain includes installation ,configuration and administration of all the servers running applications pertaining to IPTV.
    The class of servers we use are Sun sparc and X86 machines runing solaris and Linux respectively.

    Recently, I have been exposed to Hadoop due to a POC (Proof of Concept)requirement for an anticipated project raised by the sales team of our organization. The POC involved me for installation and configuration of a demo cluster of 20 nodes. I was totally alien to Hadoop at that time and at that instance i was intrigued with the power of Hadoop. I started learning Hadoop in my free time and researched internet about geting certified in Hadoop Administration.

    Now, here I am searching for more wisdom on learning Hadoop and a guidance for becoming a certified professional in Hadoop Administration.

    Please reply with some good suggestions. Suggestions may include books , documentation, video lectures or anything.

  4. rICh morrow Reply

    Hi Abbas,

    I hate to sound like a salesman, but if your goal is to become a professional Hadoop admin, you should probably register in a professional, instructor let Hadoop Admin class like http://www.globalknowledge.com/training/certification_listing.asp?pageid=12&certid=1138 If the costs or timing don’t work for you, there are also a ton of good video courses like https://www.udemy.com/courses/search/?q=hadoop and (shameless plug, it’s my own): http://bit.ly/get-hadoop

    I’ve also compiled a whole set of links around the Hadoop ecosystem — this can help you dive deeper down certain topics: bit.ly/hadoop-links

    Happy Hacking,
    -r