Hadoop, too much local cache is a bad thing

Lessons learned here.

1. Too much local cache is a bad thing.
2. Why does it take so long after a tasktracker restart to receive tasks?

I run  a very stable 36 node hadoop cluster for our hive data warehouse.
Recently however we started having TaskTrackers being blacklisted.
After eliminating the usual suspects of hardware problems, networking, jobs gone wild, memory problems etc…..
We started digging into the logs.
One line consistently showed up:

WARN fs.LocalDirAllocator$AllocatorPerContext (LocalDirAllocator.java:createPath(256)) –org.apache.hadoop.util.DiskChecker$DiskErrorException: can not create directory: /disk2/hadoop/mapred/local/taskTracker/archive/hadoopm101.sacpa.videoegg.com/export/hadoop/temp/hadoop-hadoop/mapred/system/job_201003012328_167821/libjars

After confirming that the disk was fine we started looking into the file system.
Again we eliminated the usual suspects of disk space, inodes, iowait.
It wasn’t until we started traversing the file system that found our first clue.
An ls -l hung. Red flag for a sysadmin. The reason that we hit the limit in the number of sub directories for ext3.

[root@hadoop2108.sacpa hadoop]# ls /disk1/hadoop/mapred/local/taskTracker/archive/hadoopm101.sacpa/export/hadoop/temp/hadoop-hadoop/mapred/system | wc -l
31998

Even though the local cache has thousands of files and directories some as old as three months they we never cleaned out because we still hadn’t reached the default limit of 10GB.
In our case because we four disks per node thats 40GB of local cache per node.
The fix in our case was to drop the local.disk.cache setting in core-site.xml from 10GB to 1GB.
The second lesson we learned was that when you start a tasktracker it won’t receive any work until it cleans out its local cache. In our case it was taking up to 15 minutes.
This was difficult to discover as even as DEBUG mode nothing about this is written to the logs.

Advertisements

2 thoughts on “Hadoop, too much local cache is a bad thing

  1. I think the parameter you are talking about is local.cache.size and not local.disk.cacheIt has since then also been renamed to mapreduce.tasktracker.cache.local.sizehttps://issues.apache.org/jira/browse/MAPREDUCE-2379

  2. My problem as same as yours,I have enough space but after 3 days crawling of 150000 links this exception occur, I used the script in nutch wiki for crawling and used the default hadoop conf , are you sure all this happen for hadoop local cache?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s