start / stop script for mongodb

Mongodb didn’t come with a start up script, at least not on ubuntu.

Below is the start-up script I am using, it uses a config file /etc/mongodb.conf

My settings in the conf file are just some basics listed below:



The start-up script:

    #! /bin/sh


    do_start () {
    echo  “Starting mongo with-> /usr/bin/mongod –config $CONF”
    /usr/bin/mongod –config $CONF

    do_stop () {
    echo “checking if mongo is running”
    PID=$(cat /data/db/mongod.lock)
    if [ -z “$PID” ];
        echo “mongod isn’t running, no need to stop”
    echo ” it is”
    echo “Stopping mongo with-> /bin/kill -2 $PID”
    /bin/kill -2 $PID


    case “$1” in
        echo “Usage: $0 start|stop” >&2
        exit 3





Global site load balancing(GSLB) coming to Amazon’s EC2 with Route 53

Route 53 is Amazon’s DNS product. ( still in beta)

I their latest offering they can host the DNS records for your domain but even more exciting is that in the future they will support GSLB.

This means that you can have instances in their 4 regions( Northern California, Virginia, Ireland & Singapore) and have amazon direct your users to the closest instance.

This will greatly increase the quality of user experience by reducing latency and increase redundancy.

They also plan on tighly integrating it when you launch an instance so you can launch it and give it the DNS name in one step.

Curretly this is a two step process where you launch the instance, wait until you get its amazon DNS and then create the CNAME on your DNS provider.

Also the only to currrently do GLSB is using a third party such as Akamai….which is really expensive BTW.

From the Route 53 Documentation:

“In the future, we plan to add additional integration features such as the ability to automatically tie your Amazon Elastic Load Balancer instances to a DNS name, and the ability to route your customers to the closest EC2 region.”

More on Route 53:


setting up software raid zero over 8 volumes on EC2 using EBS, mdadm and xfs

I’ve read a lot of blog posts about how to set up software raid on EC2 using EBS volumes but didn’t find all the steps in one page, this is my attempt to do so.

Before you start you will need to have the ec2-api-tools installed on your local machine which can be found here:

I won’t go into the steps for installing and configuring but I will list the local envorionment variables that you should have in your .bash_profile

They are:

export PATH=$PATH:/Users/dodell/ec2-api-tools/bin
export EC2_HOME=’/Users/dodell/ec2-api-tools’
export JAVA_HOME=’/System/Library/Frameworks/JavaVM.framework/Home’
export EC2_CERT=/Users/dodell/aws_keys/cert-XXXXXXXXXXXX.pem
export EC2_PRIVATE_KEY=/Users/dodell/aws_keys/pk-XXXXXXXXXXXXXXX.pem

In this post I’m detailing a raid 0 volume I created which is 4TB total in size and striped over 8 EBS volumes.


Also one really helpful trick I found was to tag the volumes you create with the host they will be attached to and the device they will be attached as.

After tagging its really easy to tell which volume gets attached to which host

ec2-describe-volumes vol-XXXXXXX
VOLUME    vol-XXXXXX    500        us-east-1a    in-use    2011-01-11T23:49:19+0000
ATTACHMENT    vol-XXXXXXX    i-XXXXX    /dev/sdm    attached    2011-01-11T23:58:44+0000
TAG    volume    vol-XXXXXXX   host    mdb05
TAG    volume    vol-XXXXXXX   mount    sdm


Steps 1 -4 happen on your local machine

Step 1 – create the volumes

ec2-create-volume -z us-east-1a -s 500

This will create a 500GB volume
Repeat as necessary

Step 2 – tag the volumes with which host they will be attached to

ec2-describe-volumes | grep '2011-01-11T23' | grep available| awk '{print $2}'| ec2-create-tags - -t host=mdb05

This will apply the same tag to all the volumes you created above( you will need to change what you grep for )

Step 3 – attach the volumes to the host

ec2-attach-volume vol-XXXXXX -i i-XXXXX -d /dev/sdm

repeat as necessary

Step 4 – tag the volume with which device it will be mounted as

ec2-create-tags vol-XXXXXX -t mount=sdm

repeat as necessary

Now you have x number of volumes attached to your instance.

Steps 5 – 13 occur on the instance where you attached the volumes

Step 5 – create the raid device using mdadm

sudo mdadm --create /dev/md0 --chunk=256 --level=0 --raid-devices=8 /dev/sdm /dev/sdn /dev/sdo /dev/sdp /dev/sdq /dev/sdr /dev/sds /dev/sdt

Step 6 – verify that the raid device /dev/md0 exists

cat /proc/mdstat

Step 7 – add devices to the mdadm.conf file

sudo echo DEVICE /dev/sdm /dev/sdn /dev/sdo /dev/sdp /dev/sdq /dev/sdr /dev/sds /dev/sdt| sudo tee /etc/mdadm/mdadm.conf

Step 8 – add the other device info about /dev/md0 to the mdadm.conf file so that it comes back on reboot

sudo mdadm --detail --scan | sudo tee -a /etc/mdadm/mdadm.conf

Step 9 – make the file system

sudo mkfs.xfs /dev/md0

Step 10 – mount the file system

sudo mount /dev/md0 /data/

Step 11 – verify the volume exists and the size you expected

df -h /data
Filesystem            Size  Used Avail Use% Mounted on
/dev/md0              4.0T  250G  3.7T   7% /data

Step 12 – edit fstab so that the volume comes back on reboot

/dev/md0    /data    xfs    noatime,nodiratime,allocsize=512m      0       0

Step 13 – test all of you work by rebooting


Hadoop: topology, rack awareness and NameNode caching

When attempting to add 10 more nodes to our cluster last week I made several mistakes which brought to light some “features” of hadoop’s rack aware topology.

Lesson #1-> Before adding a new node make sure you can resolve the host by name and ip(A & PTR) using the rack aware script.

The first mistake I made was only adding the A records and not the PTR records. The rack aware script appears to mostly look up the node by ip address. Our topology is in the syntax of /row$/rack$ and in my script if it can’t resolve the host it will return with a default of /default-rack.

Lesson #2 -> The namenode permantly caches the results of the rack aware script, if you make a mistake or are moving nodes around you will have to restart the NameNode daemon.

I have a theory(just a theory) that if the syntax of the default rack matched the other nodes ie. /row1/rack0 then it would run fine.

Lesson #3-> A  datanode with a poor topology mapping can break map reduce.

Okay so that this point I had one new node in the cluster who’s topology resolved to /default-rack and it was only running datanode. HDFS ran fine, it was getting blocks posted to it hourly but then map reduce jobs started getting hung. What, how can a poorly configured datanode break map reduce? Sure enough jobs running with blocks on the new node wear hanging, jobs from older data(I hadn’t run balancer yet) ran fine.

The fix was to add the A and PTR records for the new nodes and restart the NameNode.

Thank you to goturtlesgo from the #hadooop IRC room for helping confirm these “features”.

Jira ticket about NameNode permantly caching rack_awareness script look-ups.




Hadoop, too much local cache is a bad thing

Lessons learned here.

1. Too much local cache is a bad thing.
2. Why does it take so long after a tasktracker restart to receive tasks?

I run  a very stable 36 node hadoop cluster for our hive data warehouse.
Recently however we started having TaskTrackers being blacklisted.
After eliminating the usual suspects of hardware problems, networking, jobs gone wild, memory problems etc…..
We started digging into the logs.
One line consistently showed up:

WARN fs.LocalDirAllocator$AllocatorPerContext ( –org.apache.hadoop.util.DiskChecker$DiskErrorException: can not create directory: /disk2/hadoop/mapred/local/taskTracker/archive/

After confirming that the disk was fine we started looking into the file system.
Again we eliminated the usual suspects of disk space, inodes, iowait.
It wasn’t until we started traversing the file system that found our first clue.
An ls -l hung. Red flag for a sysadmin. The reason that we hit the limit in the number of sub directories for ext3.

[root@hadoop2108.sacpa hadoop]# ls /disk1/hadoop/mapred/local/taskTracker/archive/hadoopm101.sacpa/export/hadoop/temp/hadoop-hadoop/mapred/system | wc -l

Even though the local cache has thousands of files and directories some as old as three months they we never cleaned out because we still hadn’t reached the default limit of 10GB.
In our case because we four disks per node thats 40GB of local cache per node.
The fix in our case was to drop the local.disk.cache setting in core-site.xml from 10GB to 1GB.
The second lesson we learned was that when you start a tasktracker it won’t receive any work until it cleans out its local cache. In our case it was taking up to 15 minutes.
This was difficult to discover as even as DEBUG mode nothing about this is written to the logs.