When attempting to add 10 more nodes to our cluster last week I made several mistakes which brought to light some “features” of hadoop’s rack aware topology.
Lesson #1-> Before adding a new node make sure you can resolve the host by name and ip(A & PTR) using the rack aware script.
The first mistake I made was only adding the A records and not the PTR records. The rack aware script appears to mostly look up the node by ip address. Our topology is in the syntax of /row$/rack$ and in my script if it can’t resolve the host it will return with a default of /default-rack.
Lesson #2 -> The namenode permantly caches the results of the rack aware script, if you make a mistake or are moving nodes around you will have to restart the NameNode daemon.
I have a theory(just a theory) that if the syntax of the default rack matched the other nodes ie. /row1/rack0 then it would run fine.
Lesson #3-> A datanode with a poor topology mapping can break map reduce.
Okay so that this point I had one new node in the cluster who’s topology resolved to /default-rack and it was only running datanode. HDFS ran fine, it was getting blocks posted to it hourly but then map reduce jobs started getting hung. What, how can a poorly configured datanode break map reduce? Sure enough jobs running with blocks on the new node wear hanging, jobs from older data(I hadn’t run balancer yet) ran fine.
The fix was to add the A and PTR records for the new nodes and restart the NameNode.
Thank you to goturtlesgo from the #hadooop IRC room for helping confirm these “features”.
Jira ticket about NameNode permantly caching rack_awareness script look-ups.
https://issues.apache.org/jira/browse/HDFS-870