Hadoop: topology, rack awareness and NameNode caching

When attempting to add 10 more nodes to our cluster last week I made several mistakes which brought to light some “features” of hadoop’s rack aware topology.

Lesson #1-> Before adding a new node make sure you can resolve the host by name and ip(A & PTR) using the rack aware script.

The first mistake I made was only adding the A records and not the PTR records. The rack aware script appears to mostly look up the node by ip address. Our topology is in the syntax of /row$/rack$ and in my script if it can’t resolve the host it will return with a default of /default-rack.

Lesson #2 -> The namenode permantly caches the results of the rack aware script, if you make a mistake or are moving nodes around you will have to restart the NameNode daemon.

I have a theory(just a theory) that if the syntax of the default rack matched the other nodes ie. /row1/rack0 then it would run fine.

Lesson #3-> A  datanode with a poor topology mapping can break map reduce.

Okay so that this point I had one new node in the cluster who’s topology resolved to /default-rack and it was only running datanode. HDFS ran fine, it was getting blocks posted to it hourly but then map reduce jobs started getting hung. What, how can a poorly configured datanode break map reduce? Sure enough jobs running with blocks on the new node wear hanging, jobs from older data(I hadn’t run balancer yet) ran fine.

The fix was to add the A and PTR records for the new nodes and restart the NameNode.

Thank you to goturtlesgo from the #hadooop IRC room for helping confirm these “features”.

Jira ticket about NameNode permantly caching rack_awareness script look-ups.





Hadoop, too much local cache is a bad thing

Lessons learned here.

1. Too much local cache is a bad thing.
2. Why does it take so long after a tasktracker restart to receive tasks?

I run  a very stable 36 node hadoop cluster for our hive data warehouse.
Recently however we started having TaskTrackers being blacklisted.
After eliminating the usual suspects of hardware problems, networking, jobs gone wild, memory problems etc…..
We started digging into the logs.
One line consistently showed up:

WARN fs.LocalDirAllocator$AllocatorPerContext (LocalDirAllocator.java:createPath(256)) –org.apache.hadoop.util.DiskChecker$DiskErrorException: can not create directory: /disk2/hadoop/mapred/local/taskTracker/archive/hadoopm101.sacpa.videoegg.com/export/hadoop/temp/hadoop-hadoop/mapred/system/job_201003012328_167821/libjars

After confirming that the disk was fine we started looking into the file system.
Again we eliminated the usual suspects of disk space, inodes, iowait.
It wasn’t until we started traversing the file system that found our first clue.
An ls -l hung. Red flag for a sysadmin. The reason that we hit the limit in the number of sub directories for ext3.

[root@hadoop2108.sacpa hadoop]# ls /disk1/hadoop/mapred/local/taskTracker/archive/hadoopm101.sacpa/export/hadoop/temp/hadoop-hadoop/mapred/system | wc -l

Even though the local cache has thousands of files and directories some as old as three months they we never cleaned out because we still hadn’t reached the default limit of 10GB.
In our case because we four disks per node thats 40GB of local cache per node.
The fix in our case was to drop the local.disk.cache setting in core-site.xml from 10GB to 1GB.
The second lesson we learned was that when you start a tasktracker it won’t receive any work until it cleans out its local cache. In our case it was taking up to 15 minutes.
This was difficult to discover as even as DEBUG mode nothing about this is written to the logs.