rackspace: their service blows but they have high uptime

I started using rackspace managed hosting back in 2005. This is before the “cloud” so they were all physical hosts.
Since then I’ve used them for a mix of cloud and physical for several different companies.
I have also build data centers from the ground up and managed a lot of services on AWS( amazon web services).
After being a customer of rackspace over the last seven years I have formed the following opinion.

1. Their service blows
If you think the cloud is “scary” and mysql gives you the willies then their support might seem like wizards but in reality they are not.
I found out from an inside source that many of their support team are high school graduates who are run through a boot camp.

Instant wizards.

I have had countless dealings with support where I found zero value in the exchange.
This isn’t true of their whole team of course, at some point you might get the contact information of someone who can actually help you.
When you do hold on to it.
A great example of this is their “managed mysql”, what it amounts to is mysql installed on a supposedly faster file system and backed up.
Of course its tuned which means the innodb_buffer_size is changed per RAM size.
Thats it.
If you really have a problem with MySQL no way are the bootcamp kids going to be of much use to you.

2. They have high uptime
In both cases of physical and cloud rackspace has very good uptime.
Especially in comparison to EC2, and I’m not just talking about the big EC2 outages I’m talking about the day to day.
Its pretty common in EC2 to have a server freeze up and need to be rebooted or have it reboot on its own.
Of course when you decide to move to the cloud this is something you need to plan for but in the case of rackspace I can only think of a few times when one of my cloud instances ever went offline unexpectedly.

simple way to get notified when a cronjob fails

Every place I’ve ever worked had cronjobs running all over the place.Some are simple tasks like clearing out a temp directory.Others end up being a critical piece of the infrastructure that a developer wrote with out telling anyone about.I like to call this type of scheduled job the glue as its usually holding your company together.

True story I once found a cronjob running on a cluster of 200 servers named brett.sh that restarted an app every 30 seconds!!

In most cases the “glue” cronjob is unknown to anyone as to where the job runs, how often and most importanlty when it fails.There are a few tools out there to put all of your scheduled jobs in one spot and will take actions on failure.Some of those include opswise (http://www.opswise.com/) which I’ve used in the past and had a lot of success with and Amazon’s Simple Workflow Service (http://aws.amazon.com/swf/) which I haven’t used yet.

There is also an opensource project sponsered by yelp called tron which does most of this already except for notifying when it fails.BTW there is a feature request for this already, ( https://github.com/Yelp/Tron/issues/25 )

Anyway as a quick work around I just add a check for the exit code in my crontab which will alert me if the job doesn’t exit zero.

Example:

1 0 * * * touch /home/dodell/foobar|| if [ $? -ne 0 ] ; then mail -s 'touch_file failed' dodell@workobee.com < /etc/hostname ;exit 1

add timestamps to your standard out and standard error

A lot of time when executing a cronjob or a long running command I capture the standard out and standard out to a log file.This works okay but without time stamps it isn’t really useful especially for a job that runs many times a day which makes it difficult to tell which lines in the log match the run.What I do now is copy a script to all my systems (using chef of course) which will annotate any output I pipe to it.A command line example:

dodell@spork/etc$ cat resolv.conf | /usr/local/bin/annotate.sh   Thu Sep  6 14:39:59 PDT 2012: # Automatically generated, do not edit  Thu Sep  6 14:39:59 PDT 2012: nameserver 173.203.4.8  Thu Sep  6 14:39:59 PDT 2012: nameserver 173.203.4.9

Okay not a super useful example but you get my point.This is even more useful when added to a cronjob:

1 0 * * * /usr/local/bin/percona_backup_and_restore.sh backup 2>&1| /usr/local/bin/annotate.sh  >> /var/log/mysql/xtrabackup.log

and the output:

Thu Sep  6 00:01:02 PDT 2012:   Thu Sep  6 00:01:02 PDT 2012: InnoDB Backup Utility v1.5.1-xtrabackup; Copyright 2003, 2009 Innobase Oy  Thu Sep  6 00:01:02 PDT 2012: and Percona Inc 2009-2012.  All Rights Reserved.  Thu Sep  6 00:01:02 PDT 2012:   Thu Sep  6 00:01:02 PDT 2012: This software is published under  Thu Sep  6 00:01:02 PDT 2012: the GNU GENERAL PUBLIC LICENSE Version 2, June 1991.  Thu Sep  6 00:01:02 PDT 2012:   Thu Sep  6 00:01:02 PDT 2012: 120906 00:01:02  innobackupex: Starting mysql with options:  --password=xxxxxxxx --user='debian-sys-maint' --unbuffered --  Thu Sep  6 00:01:02 PDT 2012: 120906 00:01:02  innobackupex: Connected to database with mysql child process (pid=19867)  Thu Sep  6 00:01:08 PDT 2012: 120906 00:01:08  innobackupex: Connection to database server closed  Thu Sep  6 00:01:08 PDT 2012: IMPORTANT: Please check that the backup run completes successfully.  Thu Sep  6 00:01:08 PDT 2012: At the end of a successful backup run innobackupex  Thu Sep  6 00:01:08 PDT 2012: prints "completed OK!".

Ah, how beautiful standard out and error with time stamps…….magic.

The code:

#!/bin/bash  while read line  do     echo "$(date): ${line}"   done

Knowing when its time to leave

I spent a brief amount of time at a start-up where the culture sucked.

Basically unless you were there from day one your opinion didn’t matter.

The reason for this rotten culture was too many people in key positions hated their jobs.

They had been there too long, were misrable but weren’t mature enough or had other reasons for not leaving.

If you work in tech and live in the bay area and don’t like your job the solution is easy.

Leave.

Most likely you will find a much better gig and end up in a culture where your opinion matters.

In the end we just want to help our company be successful by a combination of the skills that we have plus the lessons that we learn along the way.

Life is way to short to spend you time, energy and brain on a company that doesn’t make you happy.

 

 

 

DevOps vs. SysAdmin

I spent the first half of my career as a sysadmin or a director with sysadmin’s reporting to me.

My last 2 jobs I started out as devops and not sysadmin.

What’s the difference?

I really didn’t know myself.

I heard some annoying opinions that devops uses software to automate things such as configuration management.

This is nonsense, way before the term devops existed I managed > 800 servers and we used software ie. cfengine to manage configurations. We also wrote thousands of lines of perl and bash to automate other tasks such as backups, file transfers, kick starting etc…..

So using software isn’t the difference.

The difference is basically that you don’t have to be an expert in data center operations such as:

—  dealing with hardware

—  raid settings

—  procurement 

—  layer 2 & 3 networking

—  console devices

—  firewalls

—  hardware load balancing

—  PDU’s

—  calcuating BTU’s 

—  contract negotiations 

This is a shame because even though I have mostly been working in the cloud lately it really helps me to know all the underlying guts of the data center especially when trouble shooting and dealing with less than knowledgeable support staff( rackspace especially ).

What DevOps is required to know is:

—   the services their companies write and deploy

—   deploy proceedures

—   config files for the services

 —  cloud environments ie rackspace , aws , softlayer etc..

 —  continuous integration

—  you also need to have a deep knowledge of all the 3rd party daemons running such as MySQL, mongodb, redis, memcached, apache, nginx, hadoop, cassandra, riak, etc…

Lucky for me I’ve always been in position to “know the code”. In fact one of the main frustrations in working in the sysadmin role is peers who have limited knowledge or interest in what the infrastructure they support is actually being used for.

I once had a cluster of 200 servers used to edit images and several of my peers had no idea what the purpose of the cluster was. 

Another advantage to “knowing the code” is that you are required to work closely with developers. I love white boarding a problem with developers.

I think when you approach a problem with the combined tools of a developer and devops you have all the bases covered and usually end up with a solution that addresses performance, scalability, resiliency, cost , ease of support and simplicity. If a problem is only solved by devops or a developers alone you rarely cover all bases.

 

ruby script to calculate primes

I have a friend who is really into Pi. In fact he claims to have the first 200 digits memorized.Me I think primes are way cooler, first of all there are a lot of them, and they occur more than you think.

For instance 2,000,000,000,003 is a prime number!

Anyway below is a script I wrote which can be used to determine is a number is a prime or you can just run it and it will start printing out all of the primes starting with the number 2.

ruby primes.rb -h  Usage: primes.rb [ -c ] or [ -o integer]  -c                               calculate primes  -o, --one_integer integer        check if one integer is a prime  -h                               Display this screen  example: primes -c calculate all primes starting with the number 2  example: primes -o check if a given integer is a prime

The code:

require 'optparse'  require 'rubygems'  ###########  # methods #  ###########  def primes(start)    foo = 2    out = Array.new    root = Math.sqrt(start)    div = root.to_i    while  foo <=  div      foo = foo + 1      ans = start.to_f / foo.to_f      nu,de = ans.to_s.split('.')      if de == "0"        out.push(de)      end    end  return out  end  ##################  # define options #  ##################  options = {}  optparse = OptionParser.new do|opts|     opts.banner = "Usage: primes.rb [ -c ] or [ -o integer]"     options[:calculate_primes] = false     opts.on( '-c', 'calculate primes' ) do        options[:calculate_primes] = true     end    options[:one_integer] = nil      opts.on( '-o integer', '--one_integer integer', "check if one integer       is a prime" ) do |check_this|        options[:one_integer] = check_this     end     opts.on( '-h', 'Display this screen' ) do        puts opts        puts "example: primes -c calculate all primes starting with the number 2"        puts "example: primes -o check if a given integer is a prime"        exit     end   optlength = ARGV.length     if optlength < 1        puts opts.banner        exit     end  end  optparse.parse!  if options[:calculate_primes]    start = 2    while start > 1      start = start +1      if start.odd?       out = primes(start)       count = out.count         if count == 0           puts start.to_s         end      end    end  end  if options[:one_integer]    check_this = "#{options[:one_integer]}"    if check_this.to_i.even?      puts check_this.to_s + " is not a prime number"      exit    end    out = primes(check_this)      count = out.count      if count == 0        puts check_this.to_s + " is a prime number"      else        puts check_this.to_s + " is not a prime number"      end  end

Update:Turns out Ruby has a built in function to do this in 3 lines.

require 'mathn'  list_primes = Prime.new  puts list_primes.each { |prime| print prime.to_s + "n", " "; break unless prime > 1 }

Thank you mathn!

lessons learn while devops@posterous

I’m a proud former employee of Posterous and worked there for the 10 months prior to the twitter acquisition.
Although it was a short time  we experienced a lot of growth and it was very intense with a lot of changes in user behavior, product and infrastructure.
Like everything in life when you look back you can always find better ways to do things.
Its been about six months since I left and below are a few of the lessons that I’ve learned and will take with me to my next gig.

MySQL:

– Install a gui tool to help manage the DB’s. Yeah gui’s sound lame but they put a lot of valuable information together in one place and don’t cost you anything.

– Always have a spare slave that you can use to test out new configs, place in service for a while, iterate.

– Go to every MySQL meet up possible, they are a huge source of information and you will walk away from everyone with one or more new trick up your sleeve.

FIGHT SPRAWL:

– Every time you introduce a new technology make it a requirement to have it replace an existing technology in addition to supporting a new feature.
In most cases this isn’t a stretch as a lot of technologies are similar to each other (redis/memcached), (riak/mongodb) etc……

ADMIT YOUR SHORT COMINGS, FOCUS ON THE POSITIVE:

Posterous was rails shop with an abundance of really skilled developers which meant we moved at a rapid pace, multiple deployments a day….we were constantly in a state of change.
Working in an environment like that has advantages and disadvantages.
The disadvantages are that change equals risk.
Advantages are that we were able to push changes to the site soon after the code change, which means if something when wrong the developer didn’t have to context switch to fix the problem.
We spent a lot of time addressing the negative side of rapid development by attempting to introduce a QA process meant to slow things down and prevent mistakes.
Now that I look back I think we should have focused on the positive side of rapid change by speeding up the deployment process and pushing code even more often.
In a perfect world you have a automated QA test that runs in 1 minute and has a 100% coverage….or maybe thats not the perfect world but fantasy island I’m thinking of.

FORCE YOURSELF TO PLAN LONG:

In a rapidly moving start-up with a burn rate it seems like a waste of time to plan long but when it comes to infrastructure its absolutely necessary.
I can go to most web companies and when they talk about their infrastructure there are always one off services that they considered “legacy”.
Legacy is code word for we rolled it out without thinking more than a month out.
Once a month take the key people offsite for a few hours to white board your current infrastructure and what it would look like you got to design it from scratch….that is your long term plan. Now when choosing a new technology you already know how it will fit in long term.

MySQL replication and failover with tungsten-replicator

Having administered MySQL in production for the last 7 years I know how painful it is when a master goes down or when replication breaks.Its also annoying when you rebuild a slave and it takes a really long time for replication to catch up.I decided to give tungsten-replication a look as it claims to be high performance and handles promoting a slave to master.

My setup:

—Two rackspace cloud servers running Ubuntu 12.04

—2GB RAM

—4 cores

—Percona server 5.5

—tungsten-replicator 2.0.5 (free version)

Installation

It took me a while( ~4 hours) to satisfy all of the prerequisites for the OS and MySQL.This wasn’t helped by the mostly horrible documentation. All of the information was there but it was laid out all over the place and took several attempts to find all the pieces.It would have been a huge help to have a script that you can run to check your prerequisites such as ruby version, java version etc….Once I satisfied all the prerequisites it was time to move on to the config

Configuration

The good news here is that tungsten-replication comes with a scripted installation designed to install and configure all of your nodes with one command.The bad news once again is that the documentation is scattered all over the place and really incomplete.Also a really annoying problem is getting to the help output with the commands.Some require —help, some help and others have nothing at all.In the open source version there are 2 types of configuration.1. master-slave2. direct which supports multi-master

master-slave

To get master slave up I downloaded the release:

wget http://code.google.com/p/tungsten-replicator/downloads/detail?name=tungsten-replicator-2.0.5.tar.gz&can=2&q=

untarred it:

tar xzvf tungsten-replicator-2.0.5.tar.gz

cd to tungsten-replicator-2.0.5 and ran the following command:

./tools/tungsten-installer -v       --master-slave       --master-host=198.201.208.173       --datasource-user=tungsten       --datasource-password=tungsten       --datasource-mysql-conf=/etc/mysql/my.cnf       --service-name=gabriel       --home-directory=/opt/tungsten       --cluster-hosts=198.201.208.173,198.201.206.221       --start-and-report

This should install tungsten-replicator to /opt/tungsten start the service on both nodes and report back with its status.To check create a database on the master, it should show up on the slave.

Replication Performance

Now that I have my master-slave up and running I decided to test the replication speed.I took a 40MB dump file and loaded it onto the master.It took 6 seconds to load.What surprised me was how long it took to complete the load on the slave.It took 43 seconds for the slave to get the complete dump, that was 43 seconds after the master load completed.To eliminate the network as a bottle neck I scp’d the file from master to slave and that took 6 seconds.My guess is the process on the master that reformates the statements with global id’s is really inefficient.

failover

I still haven’t figured out how to automatically failover( good luck looking in the docs), but I did figure out how to manually promote the slave to master.First thing you need to do it take replication offline, that doesn’t mean stopping the daemon but just pausing the replication process:

/opt/tungsten/tungsten/tungsten-replicator/bin/trepctl -service gabriel offline

This needs to be run on both hosts.

To verify that both hosts have replication offline run the following:

/opt/tungsten/tungsten/tungsten-replicator/bin/trepctl status

Now your ready, in my environment 198.201.208.173 is the current master and 198.201.206.221 the slave( you can see this in the output of the status command).The first step is to change the master to a slave:

/opt/tungsten/tungsten/tungsten-replicator/bin/trepctl -service gabriel setrole  -role slave -uri thl://198.201.206.221/

run that on the current master

Then turn the old slave on to a master:/opt/tungsten/tungsten/tungsten-replicator/bin/trepctl -service gabriel setrole -role master -uri thl://198.201.206.221/Run this on the current slave.

Now check your work with the status command.When it looks good run this command on both hosts:

/opt/tungsten/tungsten/tungsten-replicator/bin/trepctl -service gabriel online

Again check your work with the status command.

protip

To run commands on both hosts is easy as in the host setup process you copied over ssh keys and authorized_keys files.I just used a bash for loop:

for H in 198.201.208.173 198.201.206.221;  do ssh root@$H "/root/tungsten-replicator-2.0.5/tungsten-replicator/bin/trepctl status"; done

conclusion

Given the replication lag, poor documentation, long setup, and the fact that its really early I consider tungsten-replicator to have a lot of potential but mostly a science experiment for now.Of course this is with the free version.Like I said in the beginning I know how much it sucks when a master fails of when replication breaks.Tungsten-replicator has all of the features needed to solve all of these problems it just doesn’t give me enough confidence to run this in my production environment…..yet

route requests by country code using nginx and the geoip module

I have a client that wanted to limit traffic to their site to only come from the US.

Right away I told them we could use a module in nginx that uses the maxmind geoip library.

It turns out I was right nginx does have a geoip module!

Lucky for me.

Anyway it turns out that using the module is really simple.

After installing it I simply told nginx to route all requests from the US to unicorn and the rest get redirected. 

# send US traffic to unicorn

  if ($geoip_country_code = US) {

      proxy_pass http://unicorn;

      break;

    }

# all other traffic gets redirected
rewrite     ^(.*)   http://foosite.com$1 permanent;

BTW yes hacksors their are a lot of ways of getting around this I know but it should be effective for most requests.

 

 

dynamic preseed file for ubuntu using sinatra

To build ubuntu physical ubuntu servers we use ubuntu preseed.

This works great but if you use a static preseed file you end up building a host that doesn’t have its hostname or static ip address set. This means that you have to manually set it afterward and we decided to automate it.

BTW it took us a while to figure out how to set a static ip in a preseed file.We blogged about it here:network-preseeding-debianubuntu-with-a-static

To do this we wrote a small sinatra app that dynamically generates the preseed file with the hostname and static ip address.

This is done by looking up the mac address of the requested host from the arp table and comparing it to a pipe delimited file that contains the mac address, what the static ip should be and its hostname.

The list is stored in a file named ip2mac.txt and was populated by a script.

The ip2mac.txt file looks like this:

172.28.0.71|a4:ba:db:35:e6:09|chi-devops11a  172.28.0.72|78:2b:cb:03:c5:44|chi-devops11b

Instead of calling a static preseed file from the pxelinux.cfg/default file we instead make a request to the sinatra app which generates it dynamically.The line in the default file we use looks like this:

append console=tty0 console=ttyS1,115200n8 initrd=ubuntu-10.04-server-amd64-  initrd.gz auto=true priority=critical preseed/url=http://172.27.0.115:4567/lucid-preseed-noraid interface=eth0 netcfg/dhcp_timeout=60 console-setup/ask_detect=false console-setup/layoutcode=us console-keymaps-at/keymap=us locale=en_US --

When the request is made the sinatra app does the following:

*  1. looks up the mac address of the request from the apr table*  2. compares the mac address to the matching line in ip2mac.txt*  3. uses the ip and hostname to populate hostname and ip variables in the preseed file*  4. returns the preseed file to the host making the request

The code:

require 'rubygems' # skip this line in Ruby 1.9  require 'sinatra'  require "erb"  require 'logger'  def log(message)    flog = Logger.new('foo.log')    flog.info(message)  end  def lookup_mac(mac)    rr = Array.new    hostfile = File.open("ip2mac.txt","r")    hostfile.readline    hostfile.each do |line|      list_ip,list_mac,name = line.split('|')      if mac.match(list_mac)    rr.push(list_ip)      end    end  return rr[0]  end  def get_mac_address()    ip =  @env['REMOTE_ADDR']    cmd = "arp -n " + ip.chomp + " | grep -v Address | awk '{print $3}'"    mac  = `#{cmd}`   return mac  end  def rev_lookup(ip)    cmd = "host " + ip + " | awk '{print $5}'"    hostname = `#{cmd}`fqdn = hostname.chop.chopreturn fqdn  end  get '/lucid-preseed-noraid' do    mac = get_mac_address()    log(mac)    ips = lookup_mac(mac)    log(ips)    fqdns = rev_lookup(ips)    @ip = ips    @fqdn = fqdns    log(fqdns)    erb :lucid_preseed_noraid  end  get '/lucid-preseed-nosrv' do    mac = get_mac_address()    log(mac)    ips = lookup_mac(mac)    log(ips)    fqdns = rev_lookup(ips)    @ip = ips    @fqdn = fqdns    log(fqdns)    erb :lucid_preseed_nosrv  end  get '/' do    "ops11"  end

To start the sinatra app just run the following:

ruby preseeder.rb