simple way to get notified when a cronjob fails

Every place I’ve ever worked had cronjobs running all over the place.Some are simple tasks like clearing out a temp directory.Others end up being a critical piece of the infrastructure that a developer wrote with out telling anyone about.I like to call this type of scheduled job the glue as its usually holding your company together.

True story I once found a cronjob running on a cluster of 200 servers named brett.sh that restarted an app every 30 seconds!!

In most cases the “glue” cronjob is unknown to anyone as to where the job runs, how often and most importanlty when it fails.There are a few tools out there to put all of your scheduled jobs in one spot and will take actions on failure.Some of those include opswise (http://www.opswise.com/) which I’ve used in the past and had a lot of success with and Amazon’s Simple Workflow Service (http://aws.amazon.com/swf/) which I haven’t used yet.

There is also an opensource project sponsered by yelp called tron which does most of this already except for notifying when it fails.BTW there is a feature request for this already, ( https://github.com/Yelp/Tron/issues/25 )

Anyway as a quick work around I just add a check for the exit code in my crontab which will alert me if the job doesn’t exit zero.

Example:

1 0 * * * touch /home/dodell/foobar|| if [ $? -ne 0 ] ; then mail -s 'touch_file failed' dodell@workobee.com < /etc/hostname ;exit 1

add timestamps to your standard out and standard error

A lot of time when executing a cronjob or a long running command I capture the standard out and standard out to a log file.This works okay but without time stamps it isn’t really useful especially for a job that runs many times a day which makes it difficult to tell which lines in the log match the run.What I do now is copy a script to all my systems (using chef of course) which will annotate any output I pipe to it.A command line example:

dodell@spork/etc$ cat resolv.conf | /usr/local/bin/annotate.sh   Thu Sep  6 14:39:59 PDT 2012: # Automatically generated, do not edit  Thu Sep  6 14:39:59 PDT 2012: nameserver 173.203.4.8  Thu Sep  6 14:39:59 PDT 2012: nameserver 173.203.4.9

Okay not a super useful example but you get my point.This is even more useful when added to a cronjob:

1 0 * * * /usr/local/bin/percona_backup_and_restore.sh backup 2>&1| /usr/local/bin/annotate.sh  >> /var/log/mysql/xtrabackup.log

and the output:

Thu Sep  6 00:01:02 PDT 2012:   Thu Sep  6 00:01:02 PDT 2012: InnoDB Backup Utility v1.5.1-xtrabackup; Copyright 2003, 2009 Innobase Oy  Thu Sep  6 00:01:02 PDT 2012: and Percona Inc 2009-2012.  All Rights Reserved.  Thu Sep  6 00:01:02 PDT 2012:   Thu Sep  6 00:01:02 PDT 2012: This software is published under  Thu Sep  6 00:01:02 PDT 2012: the GNU GENERAL PUBLIC LICENSE Version 2, June 1991.  Thu Sep  6 00:01:02 PDT 2012:   Thu Sep  6 00:01:02 PDT 2012: 120906 00:01:02  innobackupex: Starting mysql with options:  --password=xxxxxxxx --user='debian-sys-maint' --unbuffered --  Thu Sep  6 00:01:02 PDT 2012: 120906 00:01:02  innobackupex: Connected to database with mysql child process (pid=19867)  Thu Sep  6 00:01:08 PDT 2012: 120906 00:01:08  innobackupex: Connection to database server closed  Thu Sep  6 00:01:08 PDT 2012: IMPORTANT: Please check that the backup run completes successfully.  Thu Sep  6 00:01:08 PDT 2012: At the end of a successful backup run innobackupex  Thu Sep  6 00:01:08 PDT 2012: prints "completed OK!".

Ah, how beautiful standard out and error with time stamps…….magic.

The code:

#!/bin/bash  while read line  do     echo "$(date): ${line}"   done