Category Archives: Work

A week into OpenStack’s third release cycle…

With OpenStack’s second release safely out the door last week, we’re now well on our way towards the next release, due out in April. This release will be focusing on stability and deployability.

To this end, I’ve set up a HudsonJenkins box that runs a bunch of tests for me. I’ve used Jenkins before, but never in this (unintentional TDD) sort of way and I’d like to share how it’s been useful to me.

I have three physical hosts. One runs Lucid, one runs Maverick, and one runs Natty. I’ve set them up as slaves of my Hudson server (which runs separately on a cloud server at Rackspace).

I started out by adding a simple install job. It would blow away existing configuration and install afresh from our trunk PPA, create an admin user, download the Natty UEC image and upload it to the “cloud”. This went reasonably smoothly.

Then I started exercising various parts of the EC2 API (which happens to be what I’m most fluent in). I would:

  1. create a keypair (euca-create-keypair),
  2. find the image id (euca-describe-images with a bit of grep),
  3. run an instance (euca-run-instances),
  4. wait for it to go into the “running” state (euca-describe-instances),
  5. open up port 22 in the default security group (euca-authorize),
  6. find the ip (euca-describe-instances),
  7. connect to the guest and run a command (ssh),
  8. terminate the instance (euca-terminate-instances),
  9. close port 22 in the security group again (euca-revoke),
  10. delete the keypair (euca-delete-keypair),

I was using SQLite as the data store (the default in the packages) and it was known to have concurrency issues (it would timeout attempting to lock the DB), so I wrapped all euca-* commands in a retry loop that would try everything up to 10 times. This was good enough to get me started.

So, pretty soon I would see instances failing to start. However, once Jenkins was done with them, it would terminate them, and I didn’t have anything left to use for debugging. I decided to add the console log to the Jenkins output, so I just added a call to euca-get-console-output. They revealed that every so often, they’d fail to get an IP from dnsmasq. The syslog had a lot of entries from dnsmasq refusing to hand out the IP that Nova asked it to, because it already belonged to someone else. Clearly, Nova was recycling IP’s too quickly. It read through the code that was supposed to handle this several times, and it looked great. I tried drawing it on my whiteboard to see where it would fall through the cracks. Nothing. Then I tried logging the SQL for that specific operation, and it looked just fine. It wasn’t until I actually copied the sql from the logs and ran it in sqlite3’s CLI that I realised it would recycle IP’s that had just been leased. It took me hours to realise that sqlite didn’t compare these as timestamps, but as strings. They were formatted slightly differently, so it would almost always match. An 11 character patch later, this problem was solved. 1½ days of work. -11 characters. That’s about -1 character an hour. Rackspace is clearly getting their money’s worth having me work for them. I could do this all day!

That got me a bit further. Instances would now reliably come up, one at a time. I expanded out a bit, trying to run two instances at a time. This quickly  blew up in my face. This time I made do with a 4 character patch. Awesome.

At this point, I’d had too many problems with sqlite locking that I got fed up. I was close to just replacing it with MySQL to get it over with, but then I decided that it just didn’t make sense. Sure, it’s a single file and we’re using it from different threads and different processes, but we’re not pounding on it. They really ought to be able to take turns. It took quite a bit of Googling and wondering, but eventually I came up with a (counting effectively changed lines of code) 4 line patch that would tell SQLAlchemy to don’t hold connections to sqlite open. Ever. That totally solved it. I was rather surprised, to be honest. I could now remove all the retry loops, and it’s worked perfectly ever since.

So far, so good. Then I decided to try to go even more agressive. I would let the three boxes all target a single one, so they’d all three run as clients against the same single-box “cloud”. I realised that because I used private addressing, I had to expand my tests and use floating ip’s to be able to reach VM’s from another box. Having done so, I realised that this didn’t work on the box itself. A 4 line patch (really only 2 lines, but I had to split them for pep8 compliance) later, and I was ready to rock and roll.

It quickly turned out that, as I had suspected, my 4 character patch earlier wasn’t broad enough, so I expanded a bit on that (4 lines modified).

Today, though, I found that surprising amount of VM’s were failing to boot, ending up with the dreaded:

General error mounting filesystems.
A maintenance shell will now be started.
CONTROL-D will terminate this shell and reboot the system.
Give root password for maintenance
(or type Control-D to continue):

I tried changing the block device type (we use virtio by default, so I tried ide and scsi), I tried not using copy-on-write images, I tried disabling any code that would touch the images. Nothing worked. I blamed the kernel, I blamed qemu, everything.  I replaced everything, piece by piece, and it still failed quite often. After a long day of debugging, I ended looking at mountall. It seems Plymouth often segfaults in these settings (where the only console is a serial port), and when it does, mountall dies, killed by SIGPIPE. A  5 line (plus a bunch of comments) patch to mountall, that is still pending review, and I can now run hundreds of VM’s in a row and (5-10-ish) in parallel with no failures at all.

So, in the future, Jenkins will provide me with a great way to test drive and validate my changes, making sure that I don’t break anything, but right now, I’m extending the tests, discovering bugs and fixing them as I extend the test suite, very test-driven-development-y. It’s quite nice. At this rate, I should have pretty good test coverage pretty soon and be able to stay confident that things keep working.

It also think it’s kind of cool how much of a difference this week has made in terms of stability of the whole stack and only 19 lines of code have been touched. :)

OpenStack is open for business

Moments ago Rackspace announced the OpenStack project. Not only is this awesome news in and of itself, it also means that I can finally blog about it :)

The Rackspace’s IaaS offering consists of two parts: Cloud Servers and Cloud Files. Incidentally, OpenStack (so far, at least) has two main components to it: A “compute” compenent called “Nova” and a “storage” component called “Swift”. Swift is the software that runs Rackspace’s Cloud Files today. Nova was initially developed by NASA and is not currently in use at Rackspace, but will eventually replace the existing Cloud Servers platform.

Last week, we held a design summit in Austin, TX, USA, with a bunch of people from companies all around the world who all showed up to see what we were up to and to help out by giving requirements, designing the architecture or write patches. The amount of interest was astounding!

I’m sure others will be blogging at length about all that stuff, so I’d like to touch upon some of the ways in which Nova differs from the alternatives out there. I’ll leave it to someone else to talk about Swift.

  • Nova is written in Python and uses Twisted.
  • Nova is completely open source. There’s no secret sauce. We won’t ever limit functionality or performance so that we can sell you an enterprise edition. It’s all released under the Apache license, so it’s conceivable that some company might write proprietary, for-pay extensions, but it won’t be coming from us. Ever. This is true for Swift as well, by the way.
  • Nova currently uses Redis for its key-value store.
  • Nova can use either LDAP or its key-value store for its user database.
  • Nova currently uses AMQP for messaging, which is the only mechanism with which the different components of Nova communicate.
  • The physical hosts that will run the virtual machines all have a component of Nova running on them. It takes care of setting up disk space and other parts of the virtual machine preparation.
  • It supports the EC2 query API.
  • The Rackspace API is in the works. I expect this will be the basis for the “canonical” API of Nova in the future, but any number of API’s could be supported.

I cannot explain how excited I am about this. Let me know what you think!

“I got redirected here from What gives?”

I got fed up with the old site. It was unfocused, unprofessional, not very pretty, out-of-date.. Frankly, I was feeling embarassed about it.

I took it offline completely a couple of weeks ago, expecting to redo it altogether.  While thinking about its future and trying to write a few things for the new web site, I found it more and more awkward to pretend that my company and I were separate entitites. There’s only me in the company. It’s always been that way. I’ve had a few people I’ve known that I could rely on if I got too busy or somehow ended up with assignments with requirements I couldn’t meet, and at some point in the future there might be more people in the company, but for the time being, it’s just me. Realising this and not pretending or attempting to create the illusion that it’s something it’s not makes this whole thing more straightforward.

So, instead of spending a lot of time writing content for a new website, I’ll try to see if a simple blog will serve me well. Welcome.

Not an April fool’s joke

Today marks the beginning of my second month working for Rackspace.

I’ve realised I haven’t actually blogged about my leaving Canonical, so this post doubles as an announcement about that, I suppose.

A lot of thought was put into that decision. Ubuntu is an awesome project to work on and Canonical was a fun and interesting “place” to work, but “all good things must come to an end” so I decided to “quit while I was ahead”. Come up with more clichées if you feel like it. The short story is that I just wasn’t having much fun anymore.

Rackspace came along as an interesting option. I’ve known about them since forever, and they are doing very interesting stuff in the cloud computing area, so it seemed like a natural progression. I had a few interviews and after we overcame some initial difficulties (they’re not that used to having people from Denmark work for them) I started my new job working on Cloud Sites on March 1st.

This does not mean that I’m going to stop working on Ubuntu, though. It’ll just be on my own time and working on a narrower set of things than I have for a while. I also hope to be at UDS (I’ve applied for sponsorship) so that I can meet all my awesome, old colleagues.