Speed up your backups by compressing and posting in parallel

Its pretty typical to have data stores that are several hundreds of GB’s in size and need to be posted offsite.

At weheartit our database is ~1/2 TB uncompressed and the old method of compressing and posting to S3 took 9 hours and rarely completed.

I was able to speed up this process and now it completes in < 1 hour.
53 minutes in fact.

For compression I used pbzip2 instead of gzip.

This is how I am using it along with percona’s xtrabackup.

innobackupex --user root --password $PASSWORD --slave-info --safe-slave-backup --stream=tar ./ | pbzip2 -f -p40 -c  > $BACKUPDIR/$FILENAME

The backup and compression only takes 32 minutes and compresses it from 432GB to 180GB

Next comes speeding up the transfer to S3.

In November of 2010 amazon added this feature to S3 but for some reason this functionality hasn’t been added to s3cmd.
Instead I am using s3 multipart upload
Thanks David Arther!

This is how I am using it.

/usr/local/bin/s3-mp-upload.py --num-processes 40 -s 250 $BACKUPDIR/$FILENAME $BUCKET/$(hostname)/$TIME/$FILENAME

It only takes 20 minutes to copy 180GB over the internet!
That is crazy fast.
In both cases you can play around with the number of threads for both pbzip2 and s3 multi part upload, the threads I use work for me but that depends on the size of your system.

Advertisements

What I do in DevOps

I take ownership of a companies infrastructure.
To manage it I write software.
My language of choice is ruby and the framework is chef.
Along with these skills I also bring an expertise in many technologies.
MySQL, apache, Linux, mongodb, data centers, cloud providers, etc……..
I also pair with engineering to streamline processes such as deployments, metrics and performance, scaling, security and continuous integration and testing.

simple way to get notified when a cronjob fails

Every place I’ve ever worked had cronjobs running all over the place.Some are simple tasks like clearing out a temp directory.Others end up being a critical piece of the infrastructure that a developer wrote with out telling anyone about.I like to call this type of scheduled job the glue as its usually holding your company together.

True story I once found a cronjob running on a cluster of 200 servers named brett.sh that restarted an app every 30 seconds!!

In most cases the “glue” cronjob is unknown to anyone as to where the job runs, how often and most importanlty when it fails.There are a few tools out there to put all of your scheduled jobs in one spot and will take actions on failure.Some of those include opswise (http://www.opswise.com/) which I’ve used in the past and had a lot of success with and Amazon’s Simple Workflow Service (http://aws.amazon.com/swf/) which I haven’t used yet.

There is also an opensource project sponsered by yelp called tron which does most of this already except for notifying when it fails.BTW there is a feature request for this already, ( https://github.com/Yelp/Tron/issues/25 )

Anyway as a quick work around I just add a check for the exit code in my crontab which will alert me if the job doesn’t exit zero.

Example:

1 0 * * * touch /home/dodell/foobar|| if [ $? -ne 0 ] ; then mail -s 'touch_file failed' dodell@workobee.com < /etc/hostname ;exit 1

add timestamps to your standard out and standard error

A lot of time when executing a cronjob or a long running command I capture the standard out and standard out to a log file.This works okay but without time stamps it isn’t really useful especially for a job that runs many times a day which makes it difficult to tell which lines in the log match the run.What I do now is copy a script to all my systems (using chef of course) which will annotate any output I pipe to it.A command line example:

dodell@spork/etc$ cat resolv.conf | /usr/local/bin/annotate.sh   Thu Sep  6 14:39:59 PDT 2012: # Automatically generated, do not edit  Thu Sep  6 14:39:59 PDT 2012: nameserver 173.203.4.8  Thu Sep  6 14:39:59 PDT 2012: nameserver 173.203.4.9

Okay not a super useful example but you get my point.This is even more useful when added to a cronjob:

1 0 * * * /usr/local/bin/percona_backup_and_restore.sh backup 2>&1| /usr/local/bin/annotate.sh  >> /var/log/mysql/xtrabackup.log

and the output:

Thu Sep  6 00:01:02 PDT 2012:   Thu Sep  6 00:01:02 PDT 2012: InnoDB Backup Utility v1.5.1-xtrabackup; Copyright 2003, 2009 Innobase Oy  Thu Sep  6 00:01:02 PDT 2012: and Percona Inc 2009-2012.  All Rights Reserved.  Thu Sep  6 00:01:02 PDT 2012:   Thu Sep  6 00:01:02 PDT 2012: This software is published under  Thu Sep  6 00:01:02 PDT 2012: the GNU GENERAL PUBLIC LICENSE Version 2, June 1991.  Thu Sep  6 00:01:02 PDT 2012:   Thu Sep  6 00:01:02 PDT 2012: 120906 00:01:02  innobackupex: Starting mysql with options:  --password=xxxxxxxx --user='debian-sys-maint' --unbuffered --  Thu Sep  6 00:01:02 PDT 2012: 120906 00:01:02  innobackupex: Connected to database with mysql child process (pid=19867)  Thu Sep  6 00:01:08 PDT 2012: 120906 00:01:08  innobackupex: Connection to database server closed  Thu Sep  6 00:01:08 PDT 2012: IMPORTANT: Please check that the backup run completes successfully.  Thu Sep  6 00:01:08 PDT 2012: At the end of a successful backup run innobackupex  Thu Sep  6 00:01:08 PDT 2012: prints "completed OK!".

Ah, how beautiful standard out and error with time stamps…….magic.

The code:

#!/bin/bash  while read line  do     echo "$(date): ${line}"   done

DevOps vs. SysAdmin

I spent the first half of my career as a sysadmin or a director with sysadmin’s reporting to me.

My last 2 jobs I started out as devops and not sysadmin.

What’s the difference?

I really didn’t know myself.

I heard some annoying opinions that devops uses software to automate things such as configuration management.

This is nonsense, way before the term devops existed I managed > 800 servers and we used software ie. cfengine to manage configurations. We also wrote thousands of lines of perl and bash to automate other tasks such as backups, file transfers, kick starting etc…..

So using software isn’t the difference.

The difference is basically that you don’t have to be an expert in data center operations such as:

—  dealing with hardware

—  raid settings

—  procurement 

—  layer 2 & 3 networking

—  console devices

—  firewalls

—  hardware load balancing

—  PDU’s

—  calcuating BTU’s 

—  contract negotiations 

This is a shame because even though I have mostly been working in the cloud lately it really helps me to know all the underlying guts of the data center especially when trouble shooting and dealing with less than knowledgeable support staff( rackspace especially ).

What DevOps is required to know is:

—   the services their companies write and deploy

—   deploy proceedures

—   config files for the services

 —  cloud environments ie rackspace , aws , softlayer etc..

 —  continuous integration

—  you also need to have a deep knowledge of all the 3rd party daemons running such as MySQL, mongodb, redis, memcached, apache, nginx, hadoop, cassandra, riak, etc…

Lucky for me I’ve always been in position to “know the code”. In fact one of the main frustrations in working in the sysadmin role is peers who have limited knowledge or interest in what the infrastructure they support is actually being used for.

I once had a cluster of 200 servers used to edit images and several of my peers had no idea what the purpose of the cluster was. 

Another advantage to “knowing the code” is that you are required to work closely with developers. I love white boarding a problem with developers.

I think when you approach a problem with the combined tools of a developer and devops you have all the bases covered and usually end up with a solution that addresses performance, scalability, resiliency, cost , ease of support and simplicity. If a problem is only solved by devops or a developers alone you rarely cover all bases.

 

lessons learn while devops@posterous

I’m a proud former employee of Posterous and worked there for the 10 months prior to the twitter acquisition.
Although it was a short time  we experienced a lot of growth and it was very intense with a lot of changes in user behavior, product and infrastructure.
Like everything in life when you look back you can always find better ways to do things.
Its been about six months since I left and below are a few of the lessons that I’ve learned and will take with me to my next gig.

MySQL:

– Install a gui tool to help manage the DB’s. Yeah gui’s sound lame but they put a lot of valuable information together in one place and don’t cost you anything.

– Always have a spare slave that you can use to test out new configs, place in service for a while, iterate.

– Go to every MySQL meet up possible, they are a huge source of information and you will walk away from everyone with one or more new trick up your sleeve.

FIGHT SPRAWL:

– Every time you introduce a new technology make it a requirement to have it replace an existing technology in addition to supporting a new feature.
In most cases this isn’t a stretch as a lot of technologies are similar to each other (redis/memcached), (riak/mongodb) etc……

ADMIT YOUR SHORT COMINGS, FOCUS ON THE POSITIVE:

Posterous was rails shop with an abundance of really skilled developers which meant we moved at a rapid pace, multiple deployments a day….we were constantly in a state of change.
Working in an environment like that has advantages and disadvantages.
The disadvantages are that change equals risk.
Advantages are that we were able to push changes to the site soon after the code change, which means if something when wrong the developer didn’t have to context switch to fix the problem.
We spent a lot of time addressing the negative side of rapid development by attempting to introduce a QA process meant to slow things down and prevent mistakes.
Now that I look back I think we should have focused on the positive side of rapid change by speeding up the deployment process and pushing code even more often.
In a perfect world you have a automated QA test that runs in 1 minute and has a 100% coverage….or maybe thats not the perfect world but fantasy island I’m thinking of.

FORCE YOURSELF TO PLAN LONG:

In a rapidly moving start-up with a burn rate it seems like a waste of time to plan long but when it comes to infrastructure its absolutely necessary.
I can go to most web companies and when they talk about their infrastructure there are always one off services that they considered “legacy”.
Legacy is code word for we rolled it out without thinking more than a month out.
Once a month take the key people offsite for a few hours to white board your current infrastructure and what it would look like you got to design it from scratch….that is your long term plan. Now when choosing a new technology you already know how it will fit in long term.