Tag Archives: PlanetUbuntu

Moving duplicity (and Deja-Dup) backups

In my last blog post I said that I had moved my backups from an external disk to Rackspace Cloud Files and promised I’d explain how.

Ok, so why bother? I had about 100 GB of data that was being backed up. I didn’t want to upload 99% of that, have my wifi go bonkers, and then have to start over (because Duplicity apparently isn’t very good at resuming). So, instead I wanted to make the initial backup to an external drive (the backup wouldn’t fit on my laptop’s hard drive) and defer copying it to Rackspace as time and connectivity permitted.

That was simple enough.

Once the first, full backup was made, I wanted incremental backups to go directly to Cloud Files, so I needed to get Deja-Dup to realise that there was already a backup on there.

This was the trickier bit.

When you ask Duplicity to interact with a particular backup location, it calculates a hash of the URI of it and looks that up in its cache to see if it knows about it already. If you’ve made a backup with deja-dup, you can go and look in $HOME/.cache/deja-dup. This is what I had:

soren@lenny:~$ ls -l $HOME/.cache/deja-dup/
drwxr-xr-x 2 soren soren 4096 2011-01-14 18:09 4e33cf513fa4772471272dbd07fca5be
soren@lenny:~$

You see a directory named after the hash of the uri of the backup location I used, namely “file:///media/backup” (the MD5 sum of which is 4e33cf513fa4772471272dbd07fca5be).

Inside this directory, we find:

soren@lenny:~$ ls -l /home/soren/.cache/deja-dup/4e33cf513fa4772471272dbd07fca5be/
-rw------- 1 soren soren 750938885 Jan 14 15:47 duplicity-full-signatures.20110113T170937Z.sigtar.gz
-rw------- 1 soren soren    653487 Jan 14 15:47 duplicity-full.20110113T170937Z.manifest
soren@lenny:~$

It contains a manifest and a signature file. These files in there have no record of the backup location. That information exists only in the name of the directory. Essentially, all I needed to do was to rename the directory to match the Cloud Files location. Being a bit cautious, I decided to copy it instead. The URI for a container on Cloud Files looks like “cf+http://containername”. Knowing this, it was as simple as:

soren@lenny:~$ echo -n 'cf+http://lenny' | md5sum
2f66137249874ed1fdc952e9349912d4 -
soren@lenny:~$ cd $HOME/.cache/deja-dup
soren@lenny:~/.cache/deja-dup$ cp -r 4e33cf513fa4772471272dbd07fca5be 2f66137249874ed1fdc952e9349912d4

The -n option to echo is essential. Without it, I’d have been calculating the MD5 sum of the URI with a trailing newline.

Before I ran deja-dup again, I made sure the two files above were copied to Cloud Files. If I hadn’t, the first time duplicity would talk to Cloud Files, it would realise that these files don’t exist on the expected backup location, hence the local cache of them must be invalid, so it would delete them. This happened to me the first time, so making a copy rather than just renaming the directory turned out to be a good idea.

All that was left to do now was to change my backup location in Deja-Dup. This should be simple enough, so I won’t go into detail about that.

The best part about this, I think, is that wasn’t until 5-6 days later, that my upload of the initial full backup finished. However, in the mean time, I was able to do incremental backups just fine, because all it needs to do that is the signature files from the previous runs.

Oh, and to actually upload the files, I used the “st” tool from Swift. Something like this:

soren@lenny:~$ cd /media/backup
soren@lenny:/media/backup$ st -A https://auth.api.rackspacecloud.com/v1.0 -U soren -K 6e6f742061206368616e636521212121 upload lenny *

It only took me 20 years..

tl;dr: I now have daily backups of my laptop, powered by Rackspace Cloud Files (powered by Openstack), Deja-Dup, and Duplicity.

I’ve been using computers for a long time. If memory serves, I got my first PC when I was 9, so that’s 20 years ago now. At various times, I’ve set up some sort of backup system, but I always ended up

  • annoyed that I couldn’t acutally *use* the biggest drive I had, because it was reserved for backups,
  • annoyed because I had to go and connect the drive and do something active to get backups running, because having the disk always plugged into my system might mean the backup got toasted along with my active data when disaster struck,
  • and annoyed at a bunch of other things.

Cloud storage solves the hardest part of this. With Rackspace Cloud Files, I have access to an infinite[1] amount of storage. I can just keep pushing data, Rackspace keep them safe, and I pay for exactly how much space I’m using. Awesome.

All I need is something that can actually make backups for me and upload them to Cloud Files. I’ve known about Duplicity for a long time, and I also knew that it’s been able to talk to Cloud Files for a while, but I never got into the habit of running it at regular intervals, and running it from cron was annoying, because maybe I didn’t have my laptop on when it wanted to run, and if I wasn’t logged in, by homedir would be encrypted anyway, etc. etc. Lots of chances for failure.

Enter Deja-Dup! Deja-dup is a project spearheaded by my awesome, former colleague at Canonical, Mike Terry. It uses Duplicity on the backend, and gives me a nice, really simple frontend to get it set up. It has its own timing mechanism that runs in my GNOME desktop session. This means it only runs when my laptop is on and I’m logged in. Every once in a while, it checks how long it’s been since my last backup. If it’s more than a day, an icon pops up in the notification area that offers to run a backup. I’ve only been using this for a day, so it’s only asked me once. I’m not sure if it starts on its own if I give it long enough.

A couple of caveats:

  • Deja-dup needs a very fresh version of libnotify, which means you need to either be running Ubuntu Natty, use backported libraries, or patch Deja-dup to work with the version of libnotify in Maverick. I opted for the latter approach.
  • I have a lot of data. Around 100GB worth. Some of it is VM’s, some of it is code, some of it is various media files. Duplicity doesn’t support resuming a backup if it breaks halfway, and I “only” have 8 Mbit/s upstream bandwidth.. That meant I had to stay connected to the Internet for 28 hours straight (in a perfect world) and not have anything unexpected happen along the way. I wasn’t really interested in that, so I made my initial backup to an external drive and I’m now copying the contents of that to Rackspace at my own pace. I can stop and resume at will. The tricky part here was to get Deja-Dup to understand that the backup it thinks is on an external drive really is on Cloud Files. I’ll save that for a separate post.

[1]: Maybe not actually infinite, but infinite enough.

Openstack Nova in Maverick

Ubuntu Maverick was released yesterday. Big congrats to the Ubuntu team for another release well out the door.

As you may know, both Openstack storage (Swift) and compute (Nova) are available in the Ubuntu repositories. We haven’t made a proper release of Nova yet, so that’s a development snapshot, but it’s in reasonably good shape. Swift, on the other hand, should be in very good shape and be production ready. I’ve worked mostly on Nova, so that’s what I’ll focus on.

So, to get to play with Nova in Maverick on a single machine, here are the instructions:

sudo apt-get install rabbitmq-server redis-server
sudo apt-get install nova-api nova-objectstore nova-compute \
                nova-scheduler nova-network euca2ools unzip

rabbitmq-server and redis-server are not stated as dependencies of Nova in the packages, because they don’t need to live on the same host. In fact, as soon as you add the next compute node (or API node or whatever), you’ll want to use a remote rabbitmq server and a remote database, too. But, for our small experiment here, we need a rabbitmq server and a redis server (it’s very likely that the final release of Nova will not require Redis, but for now, we need it).

A quick explanation of the different components:

RabbitMQ
is a messaging system the implements AMQP.  Basically, it’s a server that passes messages around between the other components that make up Nova.
nova-api
is the API server (I was schocked to learn this, too!) . It implements a subset of the Amazon EC2. We’re working on adding the rest, but it takes time. It also implements a subset of the Rackspace API.
nova-objectstore
stores objects. It implements the S3 API. It’s quite crude. If you’re serious about storing objects, Swift is what you want. Really.
nova-compute
the component that runs virtual machines.
nova-network
the network worker. Depending on configuration, it may just assign IP’s or it could work as the gateway for a bunch of NAT’ed VM’s.
nova-scheduler
the scheduler (another schocker). When a user wants to run a virtual machine, they send a request to the API server. The API server asks the network worker for an IP and then passes off handling to the scheduler. The scheduler decides which host gets to run the VM.

Once it’s done installing (which should be a breeze), you can create an admin user (I name mine “soren” for obvious reasons):

sudo nova-manage user admin soren

and create a project (also named soren) with the above user as the project admin:

sudo nova-manage project create soren soren

Now, you’ll want to get a hold of your credentials:

sudo nova-manage project zipfile soren soren

This yields a nova.zip in the current working directory. Unzip it..

unzip nova.zip

and source the rc file:

. novarc

And now you’re ready to go!

Let’s just repeat all that in one go, shall we?

sudo apt-get install rabbitmq-server redis-server
sudo apt-get install nova-api nova-objectstore nova-compute \
                nova-scheduler nova-network euca2ools unzip
sudo nova-manage user admin soren
sudo nova-manage project create soren soren
sudo nova-manage project zipfile soren soren
unzip nova.zip
. novarc

That’s pretty much it. Now your cloud is up and running, you’ve created an admin user and retrieved the corresponding credentials and put them in your environment.
This is not much fun without any VM’s to run, so you need to add some images. We have some small images we use for testing that you can download here:

wget http://c2477062.cdn.cloudfiles.rackspacecloud.com/images.tgz

Extract that file:

tar xvzf images.tgz

This gives you a directory tree like this:

images
|-- aki-lucid
|   |-- image
|   `-- info.json
|-- ami-tiny
|   |-- image
|   `-- info.json
`-- ari-lucid
    |-- image
    `-- info.json

As a shortcut, you could just extract this directly in /var/lib/nova and change the permisssions appropriately, but to get the full experience, we’ll use euca-* to get these images uploaded.

euca-bundle-image -i images/aki-lucid/image -p kernel --kernel true
euca-bundle-image -i images/ari-lucid/image -p ramdisk --ramdisk true
euca-upload-bundle -m /tmp/kernel.manifest.xml -b mybucket
euca-upload-bundle -m /tmp/ramdisk.manifest.xml -b mybucket
out=$(euca-register mybucket/kernel.manifest.xml)
[ $? -eq 0 ] && kernel=$(echo $out | awk -- '{ print $2 }') || echo $out

out=$(euca-register mybucket/ramdisk.manifest.xml)
[ $? -eq 0 ] && ramdisk=$(echo $out | awk -- '{ print $2 }') || echo $out

euca-bundle-image -i images/ami-tiny/image -p machine  --kernel $kernel --ramdisk $ramdisk
euca-upload-bundle -m /tmp/machine.manifest.xml -b mybucket
out=$(euca-register mybucket/machine.manifest.xml)
[ $? -eq 0 ] && machine=$(echo $out | awk -- '{ print $2 }') || echo $out
echo kernel: $kernel, ramdisk: $ramdisk, machine: $machine

Alright, so we have images!

Now, we just need a keypair:

euca-add-keypair mykey > mykey.priv
chmod 600 mykey.priv

Let’s run a VM!

euca-run-instances $machine --kernel $kernel --ramdisk $ramdisk -k mykey

This should respond with some info about the VM, among other things, the IP.

In my case, it was 10.0.0.5:

ssh -i mykey.priv root@10.0.0.5

YAY!

I’ll leave it to someone else to provide similar instructions for Swift

OpenStack is open for business

Moments ago Rackspace announced the OpenStack project. Not only is this awesome news in and of itself, it also means that I can finally blog about it :)

The Rackspace’s IaaS offering consists of two parts: Cloud Servers and Cloud Files. Incidentally, OpenStack (so far, at least) has two main components to it: A “compute” compenent called “Nova” and a “storage” component called “Swift”. Swift is the software that runs Rackspace’s Cloud Files today. Nova was initially developed by NASA and is not currently in use at Rackspace, but will eventually replace the existing Cloud Servers platform.

Last week, we held a design summit in Austin, TX, USA, with a bunch of people from companies all around the world who all showed up to see what we were up to and to help out by giving requirements, designing the architecture or write patches. The amount of interest was astounding!

I’m sure others will be blogging at length about all that stuff, so I’d like to touch upon some of the ways in which Nova differs from the alternatives out there. I’ll leave it to someone else to talk about Swift.

  • Nova is written in Python and uses Twisted.
  • Nova is completely open source. There’s no secret sauce. We won’t ever limit functionality or performance so that we can sell you an enterprise edition. It’s all released under the Apache license, so it’s conceivable that some company might write proprietary, for-pay extensions, but it won’t be coming from us. Ever. This is true for Swift as well, by the way.
  • Nova currently uses Redis for its key-value store.
  • Nova can use either LDAP or its key-value store for its user database.
  • Nova currently uses AMQP for messaging, which is the only mechanism with which the different components of Nova communicate.
  • The physical hosts that will run the virtual machines all have a component of Nova running on them. It takes care of setting up disk space and other parts of the virtual machine preparation.
  • It supports the EC2 query API.
  • The Rackspace API is in the works. I expect this will be the basis for the “canonical” API of Nova in the future, but any number of API’s could be supported.

I cannot explain how excited I am about this. Let me know what you think!

Hudson and VMBuilder

Unhappy with the current state of VMBuilder, I recently decided to take a look at Hudson, hoping it can help improve quality going forward. Hudson is a “continuous integration” tool. This means that it’s a tool you use to apply quality control continuously rather than only either when you’re feeling bored or when a release is imminent.

I’ve set up Hudson with a number of jobs:

  • One monitors the the VMBuilder trunk bzr branch. Whenever something changes there, it downloads it, runs pylint on it, runs the unit tests (pylint and unit tests setup with help from a blog post by Joe Heck), and rolls a tarball. Finally it triggers the next job..
  • ..which builds an Ubuntu source package out of it, and triggers the next job..
  • ..which signs and uploads it to the VMBuilder PPA that I recently blogged about..
  • Last, but certainly not least, I’ve set up the very first completely automated, end to end VMBuilder test. It grabs the freshest tarball from Hudson, copies it to a reasonably beefy server, builds a VM, boots it up and upon succesful boot, it reports back that it all worked, and Hudson is happy. It doesn’t exercise all the various plugins of VMBuilder (not even close), but it’s a start!

VMBuilder in Lucid == lots of fail

Let it be no secret that I’m unhappy with the state of VMBuilder in Lucid (and in general for that matter). Way too many regressions crept in and I didn’t have time to fix them all. I still expect to do an SRU for all of this, but every time I try to attack the bureaucracy involved in this, I fail. I need to find a few consecutive hours to throw at this very soon.

Anyways, in an effort to make testing easier, I’ve set up a PPA for VMBuilder.

I’ve set up a cloud server that monitors the VMBuilder trunk bzr branch. If there’s been a new commit, it rolls a tarball, builds a source package out of it, and uploads it to that ppa. That way, adventurous users can grab packages from there and test things out before they go into an SRU. To do this, you simply run this command:

sudo add-apt-repository ppa:vmbuilder/daily

I’m also working on a setup that will automatically test these packages. The idea is to fire up another cloud server, make it install a fresh VMBuilder from that ppa, perform a bunch of tests and report back. To do this, I’m injecting an upstart job into the instance that

  1. adds the ppa,
  2. installs vmbuilder,
  3. builds a VM, which (using the firstboot option) will call back into the host when it has booted succesfully,
  4. sets up a listener waiting for this callback,
  5. waits for set amount of time for this callback.

If I get a response in a timely manner, I assume all is well. If not, it’ll notify me somehow.

The idea is to make it run a whole bunch of builds to attempt to exercise as much of the code base as possible.

I’ll try to make a habit of blogging about the progress on this as I know a lot of people are aggravated by the current state of affairs and this way, they can see that something is happening.

Cloud computing – Same old song?

I recently ended up in a conversation with a guy who turned out also to work in IT. When I mentioned I worked on cloud computing, he started talking about how it was just the same old song. Before I had a chance to reply, we were interrupted, but I haven’t really been able to push this aside, and I’d like to address this point of view, as it’s probably held by others as well.

He said that he found cloud computing to be “old wine in new bottles”. His arguments were almost exclusively about how outsourcing is a bad idea. The rest of the time he spent pointing out that for all the time he’d had an Amazon S3 account (I think he said 2-3 years) he hadn’t noticed a price reduction in spite of the price of self-hosted storage is ever decreasing.

Cloud computing certainly shares some characteristics with outsourcing. You are running services on someone else’s hardware, in their infrastucture, leaving a big chunk of responsibility with this provider. This is also true for cloud computing. It’s also true that you’re paying a premium for the hardware compared to what it would have cost if you had it in your own data center. The difference between CAPEX and OPEX seemed to be lost on him, along with the fact that you’re also freeing human ressources to work on more interesting things, but none of this is really the point.

Apart from sharing the benefits (and drawbacks!) of outsourcing, cloud computing offers a new level and type of dynamism and availability. If you’re just going to take your Exchange server (his example) or whatnot and put it on a statically allocated cloud server, then yes, it’s the same old outsourcing song. If you, however, design your service so that it can scale horizontally, the dynamism of cloud computing will let you scale both up and down to address changes in demand. This way you save money when your service is idling, yet you can scale up quickly to respond to rising demand. More ressources are (supposedly) always available and right at your fingertips. They’re a simple API call away. Leveraged properly, it’s very likely that you could not only save money running the same service in the cloud, but also be able to deal with fluctuations in service demand much better than you could in your own data center or in an old school outsourcing scenario.

As for his other point, about the prices never decreasing in spite of the cost of hosting these things yourself decreases over time.. That’s a good point. He thought that that was how the these providers were really expecting to make money. I wouldn’t go that far at all, though. What makes cloud computing a viable business is by and large the economy of scale. Hosting lots and lots and lots of virtual servers or petabyte upon petabyte of data is lots cheaper /per unit/ than hosting a few servers and a few terabytes of data, but I have to agree that it does seem that the price per GB of stored data should be decreasing over time in response to the decreasing cost of storage on the market.

Not an April fool’s joke

Today marks the beginning of my second month working for Rackspace.

I’ve realised I haven’t actually blogged about my leaving Canonical, so this post doubles as an announcement about that, I suppose.

A lot of thought was put into that decision. Ubuntu is an awesome project to work on and Canonical was a fun and interesting “place” to work, but “all good things must come to an end” so I decided to “quit while I was ahead”. Come up with more clichées if you feel like it. The short story is that I just wasn’t having much fun anymore.

Rackspace came along as an interesting option. I’ve known about them since forever, and they are doing very interesting stuff in the cloud computing area, so it seemed like a natural progression. I had a few interviews and after we overcame some initial difficulties (they’re not that used to having people from Denmark work for them) I started my new job working on Cloud Sites on March 1st.

This does not mean that I’m going to stop working on Ubuntu, though. It’ll just be on my own time and working on a narrower set of things than I have for a while. I also hope to be at UDS (I’ve applied for sponsorship) so that I can meet all my awesome, old colleagues.