Knowing when its time to leave

I spent a brief amount of time at a start-up where the culture sucked.

Basically unless you were there from day one your opinion didn’t matter.

The reason for this rotten culture was too many people in key positions hated their jobs.

They had been there too long, were misrable but weren’t mature enough or had other reasons for not leaving.

If you work in tech and live in the bay area and don’t like your job the solution is easy.

Leave.

Most likely you will find a much better gig and end up in a culture where your opinion matters.

In the end we just want to help our company be successful by a combination of the skills that we have plus the lessons that we learn along the way.

Life is way to short to spend you time, energy and brain on a company that doesn’t make you happy.

 

 

 

DevOps vs. SysAdmin

I spent the first half of my career as a sysadmin or a director with sysadmin’s reporting to me.

My last 2 jobs I started out as devops and not sysadmin.

What’s the difference?

I really didn’t know myself.

I heard some annoying opinions that devops uses software to automate things such as configuration management.

This is nonsense, way before the term devops existed I managed > 800 servers and we used software ie. cfengine to manage configurations. We also wrote thousands of lines of perl and bash to automate other tasks such as backups, file transfers, kick starting etc…..

So using software isn’t the difference.

The difference is basically that you don’t have to be an expert in data center operations such as:

—  dealing with hardware

—  raid settings

—  procurement 

—  layer 2 & 3 networking

—  console devices

—  firewalls

—  hardware load balancing

—  PDU’s

—  calcuating BTU’s 

—  contract negotiations 

This is a shame because even though I have mostly been working in the cloud lately it really helps me to know all the underlying guts of the data center especially when trouble shooting and dealing with less than knowledgeable support staff( rackspace especially ).

What DevOps is required to know is:

—   the services their companies write and deploy

—   deploy proceedures

—   config files for the services

 —  cloud environments ie rackspace , aws , softlayer etc..

 —  continuous integration

—  you also need to have a deep knowledge of all the 3rd party daemons running such as MySQL, mongodb, redis, memcached, apache, nginx, hadoop, cassandra, riak, etc…

Lucky for me I’ve always been in position to “know the code”. In fact one of the main frustrations in working in the sysadmin role is peers who have limited knowledge or interest in what the infrastructure they support is actually being used for.

I once had a cluster of 200 servers used to edit images and several of my peers had no idea what the purpose of the cluster was. 

Another advantage to “knowing the code” is that you are required to work closely with developers. I love white boarding a problem with developers.

I think when you approach a problem with the combined tools of a developer and devops you have all the bases covered and usually end up with a solution that addresses performance, scalability, resiliency, cost , ease of support and simplicity. If a problem is only solved by devops or a developers alone you rarely cover all bases.

 

ruby script to calculate primes

I have a friend who is really into Pi. In fact he claims to have the first 200 digits memorized.Me I think primes are way cooler, first of all there are a lot of them, and they occur more than you think.

For instance 2,000,000,000,003 is a prime number!

Anyway below is a script I wrote which can be used to determine is a number is a prime or you can just run it and it will start printing out all of the primes starting with the number 2.

ruby primes.rb -h  Usage: primes.rb [ -c ] or [ -o integer]  -c                               calculate primes  -o, --one_integer integer        check if one integer is a prime  -h                               Display this screen  example: primes -c calculate all primes starting with the number 2  example: primes -o check if a given integer is a prime

The code:

require 'optparse'  require 'rubygems'  ###########  # methods #  ###########  def primes(start)    foo = 2    out = Array.new    root = Math.sqrt(start)    div = root.to_i    while  foo <=  div      foo = foo + 1      ans = start.to_f / foo.to_f      nu,de = ans.to_s.split('.')      if de == "0"        out.push(de)      end    end  return out  end  ##################  # define options #  ##################  options = {}  optparse = OptionParser.new do|opts|     opts.banner = "Usage: primes.rb [ -c ] or [ -o integer]"     options[:calculate_primes] = false     opts.on( '-c', 'calculate primes' ) do        options[:calculate_primes] = true     end    options[:one_integer] = nil      opts.on( '-o integer', '--one_integer integer', "check if one integer       is a prime" ) do |check_this|        options[:one_integer] = check_this     end     opts.on( '-h', 'Display this screen' ) do        puts opts        puts "example: primes -c calculate all primes starting with the number 2"        puts "example: primes -o check if a given integer is a prime"        exit     end   optlength = ARGV.length     if optlength < 1        puts opts.banner        exit     end  end  optparse.parse!  if options[:calculate_primes]    start = 2    while start > 1      start = start +1      if start.odd?       out = primes(start)       count = out.count         if count == 0           puts start.to_s         end      end    end  end  if options[:one_integer]    check_this = "#{options[:one_integer]}"    if check_this.to_i.even?      puts check_this.to_s + " is not a prime number"      exit    end    out = primes(check_this)      count = out.count      if count == 0        puts check_this.to_s + " is a prime number"      else        puts check_this.to_s + " is not a prime number"      end  end

Update:Turns out Ruby has a built in function to do this in 3 lines.

require 'mathn'  list_primes = Prime.new  puts list_primes.each { |prime| print prime.to_s + "n", " "; break unless prime > 1 }

Thank you mathn!

lessons learn while devops@posterous

I’m a proud former employee of Posterous and worked there for the 10 months prior to the twitter acquisition.
Although it was a short time  we experienced a lot of growth and it was very intense with a lot of changes in user behavior, product and infrastructure.
Like everything in life when you look back you can always find better ways to do things.
Its been about six months since I left and below are a few of the lessons that I’ve learned and will take with me to my next gig.

MySQL:

– Install a gui tool to help manage the DB’s. Yeah gui’s sound lame but they put a lot of valuable information together in one place and don’t cost you anything.

– Always have a spare slave that you can use to test out new configs, place in service for a while, iterate.

– Go to every MySQL meet up possible, they are a huge source of information and you will walk away from everyone with one or more new trick up your sleeve.

FIGHT SPRAWL:

– Every time you introduce a new technology make it a requirement to have it replace an existing technology in addition to supporting a new feature.
In most cases this isn’t a stretch as a lot of technologies are similar to each other (redis/memcached), (riak/mongodb) etc……

ADMIT YOUR SHORT COMINGS, FOCUS ON THE POSITIVE:

Posterous was rails shop with an abundance of really skilled developers which meant we moved at a rapid pace, multiple deployments a day….we were constantly in a state of change.
Working in an environment like that has advantages and disadvantages.
The disadvantages are that change equals risk.
Advantages are that we were able to push changes to the site soon after the code change, which means if something when wrong the developer didn’t have to context switch to fix the problem.
We spent a lot of time addressing the negative side of rapid development by attempting to introduce a QA process meant to slow things down and prevent mistakes.
Now that I look back I think we should have focused on the positive side of rapid change by speeding up the deployment process and pushing code even more often.
In a perfect world you have a automated QA test that runs in 1 minute and has a 100% coverage….or maybe thats not the perfect world but fantasy island I’m thinking of.

FORCE YOURSELF TO PLAN LONG:

In a rapidly moving start-up with a burn rate it seems like a waste of time to plan long but when it comes to infrastructure its absolutely necessary.
I can go to most web companies and when they talk about their infrastructure there are always one off services that they considered “legacy”.
Legacy is code word for we rolled it out without thinking more than a month out.
Once a month take the key people offsite for a few hours to white board your current infrastructure and what it would look like you got to design it from scratch….that is your long term plan. Now when choosing a new technology you already know how it will fit in long term.

MySQL replication and failover with tungsten-replicator

Having administered MySQL in production for the last 7 years I know how painful it is when a master goes down or when replication breaks.Its also annoying when you rebuild a slave and it takes a really long time for replication to catch up.I decided to give tungsten-replication a look as it claims to be high performance and handles promoting a slave to master.

My setup:

—Two rackspace cloud servers running Ubuntu 12.04

—2GB RAM

—4 cores

—Percona server 5.5

—tungsten-replicator 2.0.5 (free version)

Installation

It took me a while( ~4 hours) to satisfy all of the prerequisites for the OS and MySQL.This wasn’t helped by the mostly horrible documentation. All of the information was there but it was laid out all over the place and took several attempts to find all the pieces.It would have been a huge help to have a script that you can run to check your prerequisites such as ruby version, java version etc….Once I satisfied all the prerequisites it was time to move on to the config

Configuration

The good news here is that tungsten-replication comes with a scripted installation designed to install and configure all of your nodes with one command.The bad news once again is that the documentation is scattered all over the place and really incomplete.Also a really annoying problem is getting to the help output with the commands.Some require —help, some help and others have nothing at all.In the open source version there are 2 types of configuration.1. master-slave2. direct which supports multi-master

master-slave

To get master slave up I downloaded the release:

wget http://code.google.com/p/tungsten-replicator/downloads/detail?name=tungsten-replicator-2.0.5.tar.gz&can=2&q=

untarred it:

tar xzvf tungsten-replicator-2.0.5.tar.gz

cd to tungsten-replicator-2.0.5 and ran the following command:

./tools/tungsten-installer -v       --master-slave       --master-host=198.201.208.173       --datasource-user=tungsten       --datasource-password=tungsten       --datasource-mysql-conf=/etc/mysql/my.cnf       --service-name=gabriel       --home-directory=/opt/tungsten       --cluster-hosts=198.201.208.173,198.201.206.221       --start-and-report

This should install tungsten-replicator to /opt/tungsten start the service on both nodes and report back with its status.To check create a database on the master, it should show up on the slave.

Replication Performance

Now that I have my master-slave up and running I decided to test the replication speed.I took a 40MB dump file and loaded it onto the master.It took 6 seconds to load.What surprised me was how long it took to complete the load on the slave.It took 43 seconds for the slave to get the complete dump, that was 43 seconds after the master load completed.To eliminate the network as a bottle neck I scp’d the file from master to slave and that took 6 seconds.My guess is the process on the master that reformates the statements with global id’s is really inefficient.

failover

I still haven’t figured out how to automatically failover( good luck looking in the docs), but I did figure out how to manually promote the slave to master.First thing you need to do it take replication offline, that doesn’t mean stopping the daemon but just pausing the replication process:

/opt/tungsten/tungsten/tungsten-replicator/bin/trepctl -service gabriel offline

This needs to be run on both hosts.

To verify that both hosts have replication offline run the following:

/opt/tungsten/tungsten/tungsten-replicator/bin/trepctl status

Now your ready, in my environment 198.201.208.173 is the current master and 198.201.206.221 the slave( you can see this in the output of the status command).The first step is to change the master to a slave:

/opt/tungsten/tungsten/tungsten-replicator/bin/trepctl -service gabriel setrole  -role slave -uri thl://198.201.206.221/

run that on the current master

Then turn the old slave on to a master:/opt/tungsten/tungsten/tungsten-replicator/bin/trepctl -service gabriel setrole -role master -uri thl://198.201.206.221/Run this on the current slave.

Now check your work with the status command.When it looks good run this command on both hosts:

/opt/tungsten/tungsten/tungsten-replicator/bin/trepctl -service gabriel online

Again check your work with the status command.

protip

To run commands on both hosts is easy as in the host setup process you copied over ssh keys and authorized_keys files.I just used a bash for loop:

for H in 198.201.208.173 198.201.206.221;  do ssh root@$H "/root/tungsten-replicator-2.0.5/tungsten-replicator/bin/trepctl status"; done

conclusion

Given the replication lag, poor documentation, long setup, and the fact that its really early I consider tungsten-replicator to have a lot of potential but mostly a science experiment for now.Of course this is with the free version.Like I said in the beginning I know how much it sucks when a master fails of when replication breaks.Tungsten-replicator has all of the features needed to solve all of these problems it just doesn’t give me enough confidence to run this in my production environment…..yet

route requests by country code using nginx and the geoip module

I have a client that wanted to limit traffic to their site to only come from the US.

Right away I told them we could use a module in nginx that uses the maxmind geoip library.

It turns out I was right nginx does have a geoip module!

Lucky for me.

Anyway it turns out that using the module is really simple.

After installing it I simply told nginx to route all requests from the US to unicorn and the rest get redirected. 

# send US traffic to unicorn

  if ($geoip_country_code = US) {

      proxy_pass http://unicorn;

      break;

    }

# all other traffic gets redirected
rewrite     ^(.*)   http://foosite.com$1 permanent;

BTW yes hacksors their are a lot of ways of getting around this I know but it should be effective for most requests.

 

 

dynamic preseed file for ubuntu using sinatra

To build ubuntu physical ubuntu servers we use ubuntu preseed.

This works great but if you use a static preseed file you end up building a host that doesn’t have its hostname or static ip address set. This means that you have to manually set it afterward and we decided to automate it.

BTW it took us a while to figure out how to set a static ip in a preseed file.We blogged about it here:network-preseeding-debianubuntu-with-a-static

To do this we wrote a small sinatra app that dynamically generates the preseed file with the hostname and static ip address.

This is done by looking up the mac address of the requested host from the arp table and comparing it to a pipe delimited file that contains the mac address, what the static ip should be and its hostname.

The list is stored in a file named ip2mac.txt and was populated by a script.

The ip2mac.txt file looks like this:

172.28.0.71|a4:ba:db:35:e6:09|chi-devops11a  172.28.0.72|78:2b:cb:03:c5:44|chi-devops11b

Instead of calling a static preseed file from the pxelinux.cfg/default file we instead make a request to the sinatra app which generates it dynamically.The line in the default file we use looks like this:

append console=tty0 console=ttyS1,115200n8 initrd=ubuntu-10.04-server-amd64-  initrd.gz auto=true priority=critical preseed/url=http://172.27.0.115:4567/lucid-preseed-noraid interface=eth0 netcfg/dhcp_timeout=60 console-setup/ask_detect=false console-setup/layoutcode=us console-keymaps-at/keymap=us locale=en_US --

When the request is made the sinatra app does the following:

*  1. looks up the mac address of the request from the apr table*  2. compares the mac address to the matching line in ip2mac.txt*  3. uses the ip and hostname to populate hostname and ip variables in the preseed file*  4. returns the preseed file to the host making the request

The code:

require 'rubygems' # skip this line in Ruby 1.9  require 'sinatra'  require "erb"  require 'logger'  def log(message)    flog = Logger.new('foo.log')    flog.info(message)  end  def lookup_mac(mac)    rr = Array.new    hostfile = File.open("ip2mac.txt","r")    hostfile.readline    hostfile.each do |line|      list_ip,list_mac,name = line.split('|')      if mac.match(list_mac)    rr.push(list_ip)      end    end  return rr[0]  end  def get_mac_address()    ip =  @env['REMOTE_ADDR']    cmd = "arp -n " + ip.chomp + " | grep -v Address | awk '{print $3}'"    mac  = `#{cmd}`   return mac  end  def rev_lookup(ip)    cmd = "host " + ip + " | awk '{print $5}'"    hostname = `#{cmd}`fqdn = hostname.chop.chopreturn fqdn  end  get '/lucid-preseed-noraid' do    mac = get_mac_address()    log(mac)    ips = lookup_mac(mac)    log(ips)    fqdns = rev_lookup(ips)    @ip = ips    @fqdn = fqdns    log(fqdns)    erb :lucid_preseed_noraid  end  get '/lucid-preseed-nosrv' do    mac = get_mac_address()    log(mac)    ips = lookup_mac(mac)    log(ips)    fqdns = rev_lookup(ips)    @ip = ips    @fqdn = fqdns    log(fqdns)    erb :lucid_preseed_nosrv  end  get '/' do    "ops11"  end

To start the sinatra app just run the following:

ruby preseeder.rb

access nested node attributes in a chef recipe

If you use chef bookmark this page as you will need to access nested keys at some point.Chef uses ohai to build a hash of each host chef-client is installed on.In a recipe this is stored in a hash named node.If you want to access a value for a key its simple.

ip = node['ipaddress']

Also if you want to determine if a key exists before you try and access it you can use attribute?()

node.attribute?('ipaddress')

BTW in most cases you will want to make sure the key exists because if it doesn’t chef-client will throw an error.

For nested keys its a bit more difficult, first of all you should always make sure the keys exists. This involves using the has_key? method in nested if statements.Then you can just pull the value from the keys.Below is one way to do it in a recipe.In this example I am making sure the keys filesystem>/dev/sda6>mount exist in the node hash.Then once I’m sure the hash exists I pull out the value.

if node.has_key? "filesystem"     if node["filesystem"].has_key? "/dev/sda6"        if node["filesystem"]["/dev/sda6"].has_key? "mount"           if node['filesystem']['/dev/sda6']['mount'] == '/srv'                execute "foo" do                command "touch /tmp/nested_keys_exist!"                action :run              end           end       end     end  end

riak cluster backup script with compression #riak

This script will create a compressed back up of a riak cluster and keep the previous days copy.I still have to add restoring as an option.

#!/usr/bin/env ruby  t = Time.new  f = t -86400  today = t.strftime("%Y-%m-%d")  yesterday = f.strftime("%Y-%m-%d")  def delete_old()    unless Dir.glob('/net/fs11/srv/posterous/nfs/riak/*-old').empty?      l = Dir.glob('/net/fs11/srv/posterous/nfs/riak/*-old')      puts "deleting oldeest backup"      File.delete(l[0])    end  end  def rotate_last(yesterday)    f = "/net/fs11/srv/posterous/nfs/riak/riak_backup-" + yesterday.chomp  + ".bz2"    t = f + "-old"    if  File.exists?(f)      puts "rotating old backup to old"      File.rename(f,t)    end  end  def run_backup(today)    puts "creating backup file"    dump = "/usr/sbin/riak-admin backup riak@172.27.0.113 riak       /net/fs11/srv/posterous/nfs/riak/riak_backup-" + today.chomp + " all"    `#{dump}`  end  def compress(today)    puts "compressing backup"    compress = "/usr/bin/pbzip2 /net/fs11/srv/posterous/nfs/riak/riak_backup-" + today.chomp    `#{compress}`  end  delete_old()  rotate_last(yesterday)  run_backup(today)  compress(today)

How to create custom graphs with Munin

Munin is lacking many features that cacti has but one thing its really good at is creating custom graphs.Basically all you need is a script written in any language that when run will print out the values and when given the config argument will print the config for the graph.In the example below I am graphing the number of unicorn processes running on a box and the number of that are busy.The values:

./unicorn_inuse cap.value 21inuse.value 11

You can see above I am getting 2 values to graph, cap.value is the total number of unicorn processes running and inuse.value is the number that are busy.

The config:

./unicorn_inuse configgraph_title Total Unicorns in useinuse.type GAUGEinuse.label Unicorns in useinuse.draw LINE1graph_category Unicorngraph_args --base 1000 -l 0graph_scale nocap.label Total Unicornscap.draw LINE2cap.type GAUGE

Not too many details in the config but graph_category is how to put graphs in a specific bucket in the munin UI.

The graph:Alt textThe code:

#!/usr/bin/env ruby  def get_total()  cmd = 'ps aux| grep capuser | grep unicorn | wc -l'  output = `#{cmd}`  num = output.match(/d+/)  return numenddef get_chillin()  cmd = "ps aux| grep capuser | grep unicorn | grep 'chillin'| wc -l"  output = `#{cmd}`  num = output.match(/d+/)  return numenddef config()       puts 'graph_title Total Unicorns in use'      puts 'inuse.type GAUGE'     puts 'inuse.label Unicorns in use'  puts 'inuse.draw LINE1'  puts 'graph_category Unicorn'     puts 'graph_args --base 1000 -l 0'      puts 'graph_scale no'    puts 'cap.label Total Unicorns'  puts 'cap.draw LINE2'  puts 'cap.type GAUGE'end    argu =  ARGV[0]     if argu == 'config'       config()     else       total = get_total()      chillin = get_chillin()      inuse = total[0].to_i - chillin[0].to_i  puts "cap.value " + total[0].to_s       puts "inuse.value " + inuse.to_s     end