rackspace: their service blows but they have high uptime

I started using rackspace managed hosting back in 2005. This is before the “cloud” so they were all physical hosts.
Since then I’ve used them for a mix of cloud and physical for several different companies.
I have also build data centers from the ground up and managed a lot of services on AWS( amazon web services).
After being a customer of rackspace over the last seven years I have formed the following opinion.

1. Their service blows
If you think the cloud is “scary” and mysql gives you the willies then their support might seem like wizards but in reality they are not.
I found out from an inside source that many of their support team are high school graduates who are run through a boot camp.

Instant wizards.

I have had countless dealings with support where I found zero value in the exchange.
This isn’t true of their whole team of course, at some point you might get the contact information of someone who can actually help you.
When you do hold on to it.
A great example of this is their “managed mysql”, what it amounts to is mysql installed on a supposedly faster file system and backed up.
Of course its tuned which means the innodb_buffer_size is changed per RAM size.
Thats it.
If you really have a problem with MySQL no way are the bootcamp kids going to be of much use to you.

2. They have high uptime
In both cases of physical and cloud rackspace has very good uptime.
Especially in comparison to EC2, and I’m not just talking about the big EC2 outages I’m talking about the day to day.
Its pretty common in EC2 to have a server freeze up and need to be rebooted or have it reboot on its own.
Of course when you decide to move to the cloud this is something you need to plan for but in the case of rackspace I can only think of a few times when one of my cloud instances ever went offline unexpectedly.

Advertisements

simple way to get notified when a cronjob fails

Every place I’ve ever worked had cronjobs running all over the place.Some are simple tasks like clearing out a temp directory.Others end up being a critical piece of the infrastructure that a developer wrote with out telling anyone about.I like to call this type of scheduled job the glue as its usually holding your company together.

True story I once found a cronjob running on a cluster of 200 servers named brett.sh that restarted an app every 30 seconds!!

In most cases the “glue” cronjob is unknown to anyone as to where the job runs, how often and most importanlty when it fails.There are a few tools out there to put all of your scheduled jobs in one spot and will take actions on failure.Some of those include opswise (http://www.opswise.com/) which I’ve used in the past and had a lot of success with and Amazon’s Simple Workflow Service (http://aws.amazon.com/swf/) which I haven’t used yet.

There is also an opensource project sponsered by yelp called tron which does most of this already except for notifying when it fails.BTW there is a feature request for this already, ( https://github.com/Yelp/Tron/issues/25 )

Anyway as a quick work around I just add a check for the exit code in my crontab which will alert me if the job doesn’t exit zero.

Example:

1 0 * * * touch /home/dodell/foobar|| if [ $? -ne 0 ] ; then mail -s 'touch_file failed' dodell@workobee.com < /etc/hostname ;exit 1

add timestamps to your standard out and standard error

A lot of time when executing a cronjob or a long running command I capture the standard out and standard out to a log file.This works okay but without time stamps it isn’t really useful especially for a job that runs many times a day which makes it difficult to tell which lines in the log match the run.What I do now is copy a script to all my systems (using chef of course) which will annotate any output I pipe to it.A command line example:

dodell@spork/etc$ cat resolv.conf | /usr/local/bin/annotate.sh   Thu Sep  6 14:39:59 PDT 2012: # Automatically generated, do not edit  Thu Sep  6 14:39:59 PDT 2012: nameserver 173.203.4.8  Thu Sep  6 14:39:59 PDT 2012: nameserver 173.203.4.9

Okay not a super useful example but you get my point.This is even more useful when added to a cronjob:

1 0 * * * /usr/local/bin/percona_backup_and_restore.sh backup 2>&1| /usr/local/bin/annotate.sh  >> /var/log/mysql/xtrabackup.log

and the output:

Thu Sep  6 00:01:02 PDT 2012:   Thu Sep  6 00:01:02 PDT 2012: InnoDB Backup Utility v1.5.1-xtrabackup; Copyright 2003, 2009 Innobase Oy  Thu Sep  6 00:01:02 PDT 2012: and Percona Inc 2009-2012.  All Rights Reserved.  Thu Sep  6 00:01:02 PDT 2012:   Thu Sep  6 00:01:02 PDT 2012: This software is published under  Thu Sep  6 00:01:02 PDT 2012: the GNU GENERAL PUBLIC LICENSE Version 2, June 1991.  Thu Sep  6 00:01:02 PDT 2012:   Thu Sep  6 00:01:02 PDT 2012: 120906 00:01:02  innobackupex: Starting mysql with options:  --password=xxxxxxxx --user='debian-sys-maint' --unbuffered --  Thu Sep  6 00:01:02 PDT 2012: 120906 00:01:02  innobackupex: Connected to database with mysql child process (pid=19867)  Thu Sep  6 00:01:08 PDT 2012: 120906 00:01:08  innobackupex: Connection to database server closed  Thu Sep  6 00:01:08 PDT 2012: IMPORTANT: Please check that the backup run completes successfully.  Thu Sep  6 00:01:08 PDT 2012: At the end of a successful backup run innobackupex  Thu Sep  6 00:01:08 PDT 2012: prints "completed OK!".

Ah, how beautiful standard out and error with time stamps…….magic.

The code:

#!/bin/bash  while read line  do     echo "$(date): ${line}"   done

Knowing when its time to leave

I spent a brief amount of time at a start-up where the culture sucked.

Basically unless you were there from day one your opinion didn’t matter.

The reason for this rotten culture was too many people in key positions hated their jobs.

They had been there too long, were misrable but weren’t mature enough or had other reasons for not leaving.

If you work in tech and live in the bay area and don’t like your job the solution is easy.

Leave.

Most likely you will find a much better gig and end up in a culture where your opinion matters.

In the end we just want to help our company be successful by a combination of the skills that we have plus the lessons that we learn along the way.

Life is way to short to spend you time, energy and brain on a company that doesn’t make you happy.

 

 

 

DevOps vs. SysAdmin

I spent the first half of my career as a sysadmin or a director with sysadmin’s reporting to me.

My last 2 jobs I started out as devops and not sysadmin.

What’s the difference?

I really didn’t know myself.

I heard some annoying opinions that devops uses software to automate things such as configuration management.

This is nonsense, way before the term devops existed I managed > 800 servers and we used software ie. cfengine to manage configurations. We also wrote thousands of lines of perl and bash to automate other tasks such as backups, file transfers, kick starting etc…..

So using software isn’t the difference.

The difference is basically that you don’t have to be an expert in data center operations such as:

—  dealing with hardware

—  raid settings

—  procurement 

—  layer 2 & 3 networking

—  console devices

—  firewalls

—  hardware load balancing

—  PDU’s

—  calcuating BTU’s 

—  contract negotiations 

This is a shame because even though I have mostly been working in the cloud lately it really helps me to know all the underlying guts of the data center especially when trouble shooting and dealing with less than knowledgeable support staff( rackspace especially ).

What DevOps is required to know is:

—   the services their companies write and deploy

—   deploy proceedures

—   config files for the services

 —  cloud environments ie rackspace , aws , softlayer etc..

 —  continuous integration

—  you also need to have a deep knowledge of all the 3rd party daemons running such as MySQL, mongodb, redis, memcached, apache, nginx, hadoop, cassandra, riak, etc…

Lucky for me I’ve always been in position to “know the code”. In fact one of the main frustrations in working in the sysadmin role is peers who have limited knowledge or interest in what the infrastructure they support is actually being used for.

I once had a cluster of 200 servers used to edit images and several of my peers had no idea what the purpose of the cluster was. 

Another advantage to “knowing the code” is that you are required to work closely with developers. I love white boarding a problem with developers.

I think when you approach a problem with the combined tools of a developer and devops you have all the bases covered and usually end up with a solution that addresses performance, scalability, resiliency, cost , ease of support and simplicity. If a problem is only solved by devops or a developers alone you rarely cover all bases.

 

ruby script to calculate primes

I have a friend who is really into Pi. In fact he claims to have the first 200 digits memorized.Me I think primes are way cooler, first of all there are a lot of them, and they occur more than you think.

For instance 2,000,000,000,003 is a prime number!

Anyway below is a script I wrote which can be used to determine is a number is a prime or you can just run it and it will start printing out all of the primes starting with the number 2.

ruby primes.rb -h  Usage: primes.rb [ -c ] or [ -o integer]  -c                               calculate primes  -o, --one_integer integer        check if one integer is a prime  -h                               Display this screen  example: primes -c calculate all primes starting with the number 2  example: primes -o check if a given integer is a prime

The code:

require 'optparse'  require 'rubygems'  ###########  # methods #  ###########  def primes(start)    foo = 2    out = Array.new    root = Math.sqrt(start)    div = root.to_i    while  foo <=  div      foo = foo + 1      ans = start.to_f / foo.to_f      nu,de = ans.to_s.split('.')      if de == "0"        out.push(de)      end    end  return out  end  ##################  # define options #  ##################  options = {}  optparse = OptionParser.new do|opts|     opts.banner = "Usage: primes.rb [ -c ] or [ -o integer]"     options[:calculate_primes] = false     opts.on( '-c', 'calculate primes' ) do        options[:calculate_primes] = true     end    options[:one_integer] = nil      opts.on( '-o integer', '--one_integer integer', "check if one integer       is a prime" ) do |check_this|        options[:one_integer] = check_this     end     opts.on( '-h', 'Display this screen' ) do        puts opts        puts "example: primes -c calculate all primes starting with the number 2"        puts "example: primes -o check if a given integer is a prime"        exit     end   optlength = ARGV.length     if optlength < 1        puts opts.banner        exit     end  end  optparse.parse!  if options[:calculate_primes]    start = 2    while start > 1      start = start +1      if start.odd?       out = primes(start)       count = out.count         if count == 0           puts start.to_s         end      end    end  end  if options[:one_integer]    check_this = "#{options[:one_integer]}"    if check_this.to_i.even?      puts check_this.to_s + " is not a prime number"      exit    end    out = primes(check_this)      count = out.count      if count == 0        puts check_this.to_s + " is a prime number"      else        puts check_this.to_s + " is not a prime number"      end  end

Update:Turns out Ruby has a built in function to do this in 3 lines.

require 'mathn'  list_primes = Prime.new  puts list_primes.each { |prime| print prime.to_s + "n", " "; break unless prime > 1 }

Thank you mathn!

lessons learn while devops@posterous

I’m a proud former employee of Posterous and worked there for the 10 months prior to the twitter acquisition.
Although it was a short time  we experienced a lot of growth and it was very intense with a lot of changes in user behavior, product and infrastructure.
Like everything in life when you look back you can always find better ways to do things.
Its been about six months since I left and below are a few of the lessons that I’ve learned and will take with me to my next gig.

MySQL:

– Install a gui tool to help manage the DB’s. Yeah gui’s sound lame but they put a lot of valuable information together in one place and don’t cost you anything.

– Always have a spare slave that you can use to test out new configs, place in service for a while, iterate.

– Go to every MySQL meet up possible, they are a huge source of information and you will walk away from everyone with one or more new trick up your sleeve.

FIGHT SPRAWL:

– Every time you introduce a new technology make it a requirement to have it replace an existing technology in addition to supporting a new feature.
In most cases this isn’t a stretch as a lot of technologies are similar to each other (redis/memcached), (riak/mongodb) etc……

ADMIT YOUR SHORT COMINGS, FOCUS ON THE POSITIVE:

Posterous was rails shop with an abundance of really skilled developers which meant we moved at a rapid pace, multiple deployments a day….we were constantly in a state of change.
Working in an environment like that has advantages and disadvantages.
The disadvantages are that change equals risk.
Advantages are that we were able to push changes to the site soon after the code change, which means if something when wrong the developer didn’t have to context switch to fix the problem.
We spent a lot of time addressing the negative side of rapid development by attempting to introduce a QA process meant to slow things down and prevent mistakes.
Now that I look back I think we should have focused on the positive side of rapid change by speeding up the deployment process and pushing code even more often.
In a perfect world you have a automated QA test that runs in 1 minute and has a 100% coverage….or maybe thats not the perfect world but fantasy island I’m thinking of.

FORCE YOURSELF TO PLAN LONG:

In a rapidly moving start-up with a burn rate it seems like a waste of time to plan long but when it comes to infrastructure its absolutely necessary.
I can go to most web companies and when they talk about their infrastructure there are always one off services that they considered “legacy”.
Legacy is code word for we rolled it out without thinking more than a month out.
Once a month take the key people offsite for a few hours to white board your current infrastructure and what it would look like you got to design it from scratch….that is your long term plan. Now when choosing a new technology you already know how it will fit in long term.