Seeing as the election for the OpenStack Project Policy Board is going on, it seems only fair to announce that I quite soon no longer will be working for Rackspace. Instead, I will be working (still on OpenStack) for Nebula. If this is material to your vote, I apologise for not disclosing this earlier, but it simply wasn’t finalised until a bit earlier this week.
I’d like to take a couple of minutes of your time to talk about testing of OpenStack. Swift has always had very good test coverage, and Glance also does pretty well, so I’ll mostly be focused on Nova.
(Psst… If you can’t be bothered to read the whole thing, just skip down to the how you can help section.)
Unit tests are by far the easiest to run. They’re right there in the development tree, a simple
./run_tests.sh away. You don’t need a complicated hardware setup, just a source code checkout.
They each exercise a small portion of the code in isolation to verify that they live up to their “contract”. More often than not, this contract is implicit. There’s no documentation of its input, output, or side effects, and maybe there doesn’t have to be. In many cases things get split up simply for readability reasons (larger routines that have grown out of control get split into smaller chunks) or to ease testing, so they’re not actually written expecting to be called from anywhere else. Documentation for all these things would be *awesome*, but a unit test should be the minimum required.
Unit tests are great. However, verifying that each piece of the puzzle does what it says on the tin is of little use if putting them all together doesn’t actually do what you set out to achieve. This is where we use functional tests. An example might be verifying that when you invoke a frontend API method that is supposed to start a virtual machine, a virtual machine actually ends up getting started in a mock hypervisor with all the correct things having been put in place along the way.
In my experience, almost every time an issue is caught by this type of test, it’s an indication that the unit tests are either wrong (e.g. when X goes into a particular routine, it checks that Y comes out, but for everything else to work, Z was actually supposed to come out) or don’t test all the edge cases. So, while a failure at this level should probably involve fixing up (or adding new) unit tests, these tests are indispensable. They verify the cooperation between the various internals, which is easy to miss when staring at each tiny little part in isolation (particularly in a piece of software like Nova that is full of side effects).
(In Nova, functional and unit tests all live in the same test suite)
Unit and black box tests are great, but what end users see is what really matters. If someone deploys all the various OpenStack components and put them together and something ultimately doesn’t work, we’ve failed. It’s all been futile.
Integration tests are often the easiest to write. When dealing with internals, it’s easy to punt on a lot of things like “should this method take this or that as an argument?,” “ideally, this db call shouldn’t live here, but it’ll have to do for now,” etc., but when it comes to what the end users sees, everything must have an answer. We can’t not have firm, concrete, simple, long-lived answers to questions like: “If I want to start a virtual machine, what do I do?,” “which argument comes first for this API call?,” etc. Hence, writing tests that start a virtual machine and then later makes sure that it started properly is rather forgiving. It’s also reassuring to end-users to know that their exact use cases are verified to work.
Again, ideally nothing should ever be caught here. If it does, it means that something slipped through a crack left by both the unit tests and black-box tests, or maybe the real KVM doesn’t act like we expected when we wrote its mock counterpart. Everything caught here should end up in a unit test somewhere once the culprit has been found.
Where do we stand today?
Unit and functional tests
As mentioned, nova’s source tree includes a test suite, comprised of both unit and functional tests. We have a Jenkins job that tracks how much of Nova is being exercised by the test suite. At the time of this writing, we have around 74% coverage. Bear in mind that if a particular line is exercised by either a unit test or a functional test (or both, of course). At our last design summit, we agreed that we’d work on improving this coverage, but clearly there’s a long way to go (that number should be in the (very) high nineties).
As for integration tests, there are a number of separate efforts:
- Nova’s own smoketests
- Very likely even more.. (please let me know if you have something similar running somewhere)
Where are we going? (a.k.a. how you can help)
Unit and functional tests
I think this is easily where we have the most work to do. Jenkins keeps track of what is covered and what isn’t:
There’s clearly lots of room for improvement. I’d like to encourage anyone who cares about QA to grab a random bit of code that isn’t yet covered by tests and add a test for it. Feel free to start with anything small and seemingly insignificant. We need to get the ball rolling. Small changes also makes the review easier.
I’ve started going through our coverage report and filing bugs about missing unit tests. Some are just a few simple statements that need tests, others are entire modules that are almost testless. Take a look and feel free to get in touch if you need help getting started.
Over the next month or so, we’re hoping to collect all these efforts (and any others out there, so please let me know!) into one. The goal is to have a common set of tests that we can run against an OpenStack intallation (i.e. all the various components that make up an actual deployment) to get early warning if something should break in a particular configuration. So, if you have anything set up to automatically test OpenStack, please get in touch. If there’s a particular configuration you care about, we want to make sure we don’t break it, so we need your help finding a good way to deploy bleeding edge OpenStack code onto your test installation and run a bunch of tests against it.
We use PPA’s quite heavily in OpenStack. Each of the core projects have a trunk PPA and a milestone-proposed PPA. Every commit to our bzr trunk branch results in an upload to the trunk PPA, and every commit to our milestone-proposed bzr branch results in an upload to (you guessed it) the milestone-proposed PPA. Additionally, we have a common openstack-release PPA for each of our major releases, where we combine all the projects into one PPA, for simpler distribution.
This poses a number of challenges.
We support every Ubuntu release since Lucid, but most of them lack new enough versions of various stuff (and in some cases, the packages are missing altogether). This means we backport a bunch of things to the various trunk PPA’s, and at the right moments we need to copy all these dependencies either from the trunk PPA to the milestone-proposed PPA (when we branch off for a new milestone) or from the milestone-proposed PPA to the common release PPA (at final release time).
This used to involve a lot of mucking around with Launchpad’s web UI which is not only boring and tedious (checking half a bajillion boxes is even less fun than it sounds), but also error prone, since it’s all manual.
I decided to write a number of tools to help make this simpler. So far, these tools are:
Simply copies a package from one PPA to another.
This one takes a number of PPA’s as arguments, and finds packages that exist in more than one of them, but at different versions. During the development cycle, this is not much of a problem since most people only run the trunk version of a single project, but when we shove them all together in one great, big PPA, it could mean that one of the projects suddenly is being run against another version of one of its dependencies than during the dev cycle.
This one takes all the packages from one PPA and copies them to another and removes stuff from the destination PPA that’s been removed from the source PPA. It’s handy if have a PPA with all your stuff in it, it’s all been QA’ed together and is in good shape, and you want to sync it all over into a “stable” PPA in one fell swoop.
Lists the contents of a PPA. Simple as that.
I’ve branched lp:ubuntu-archive-tools and added these tools to lp:~openstack-release/ubuntu-archive-tools/openstack. I can’t really decide if I think they belong inlp:ubuntu-archive-tools, but if someone else wants them I can look into getting them merged back.
I got good feedback on last week’s post about the stuff I’d achieved in Openstack, so I figured I’d do the same this week.
We left the hero of our tale (that would be me (it’s my blog, I can entitle myself however I please)) last Friday somewhat bleary eyed, hacking on a mountall patch that would more gracefully handle SIGPIPE caused by Plymouth going the way of the SIGSEGV. I got the ever awesome Scott James Remnant to review it and he (rightfully) told me to fix it in Plymouth instead. My suggested patch was much more of a workaround than a fix, but I wasn’t really in the mood to deal with Plymouth. Somehow, I had just gotten it into my head that fixing it in Plymouth would be extremely complicated. That probably had to do with the fact that I’d forgotten about MSG_NOSIGNAL for a little bit, and I imagined fixing this problem without MSG_NOSIGNAL would probably mean rewriting a bunch of I/O routines which I certainly didn’t have the brain power for at the time. Nevertheless, a few attempts later, I got it worked out. I sent it upstream, but it seems to be stuck in the moderation queue for now.
I spent almost a day and a half wondering why some of our unit tests were failing “randomly”. It only happened every once in a while, and every time I tried running it under e.g. strace, it worked. It had “race condition” written all over it. After a lot of swearing, rude gestures and attempts to nail down the race condition, I finally noticed that it only failed if a randomly generated security group name in the test case sorted earlier than “default”, which it would do about 20% of the time. We had recently fixed DescribeSecurityGroups to return an ordered resultset which broke an assumption in this test case. Extremely annoying. My initial proposed fix was a mere 10 characters, but it ended up slightly larger, but the resulting code was easier on the eyes.
Log file handling has been a bit of an eye sore in Nova since The Big Eventlet Merge™. Since then, the Ubuntu packages have simply piped stdout and stderr to a log file and restartet the workers when the log files needed rotating. I finally got fed up with this and resurrected the logdir option and after one futile attempt, I got the log files to rotate without even reloading the workers. Sanity restored.
With all this done, I could now realiably run all the instances I wanted. However, I’d noticed that they’d all be run sequentially. Our workers, while built on top of eventlet, were single-threaded. They could only handle one RPC call at a time. This meant that if the compute worker was handling a long request (e.g. one that involved downloading a large image, and postprocessing it with copy-on-write disabled), another user just wanting to look at their instance’s console output might have to wait minutes for that request to be served. This was causing my tests to take forever to run, so a’fixin’ I went. This means that each worker can now (theoretically) handle 1024 (or any other number you choose) requests at a time.
To test this, I cranked up the concurrency of my tests so that up to 6 instances could started at the same time on each host. This worked about 80% of the time. The remaining 20% instances would entirely fail to be spawned. As could have been predicted, this was a new race condition that was uncovered because we suddenly had actual concurrency in the RPC workers. This time, iptables-restore would fail when trying to run multiple instances at the exact same time. I’ve been wanting to rework our iptables handling for a looong time anyway, so this was a great reason to get to work on that. By 2 AM between Friday and Saturday, I still wasn’t quite happy with it, so you’ll have to read the next post in this series to know how it all worked out.
With OpenStack’s second release safely out the door last week, we’re now well on our way towards the next release, due out in April. This release will be focusing on stability and deployability.
To this end, I’ve set up a HudsonJenkins box that runs a bunch of tests for me. I’ve used Jenkins before, but never in this (unintentional TDD) sort of way and I’d like to share how it’s been useful to me.
I have three physical hosts. One runs Lucid, one runs Maverick, and one runs Natty. I’ve set them up as slaves of my Hudson server (which runs separately on a cloud server at Rackspace).
I started out by adding a simple install job. It would blow away existing configuration and install afresh from our trunk PPA, create an admin user, download the Natty UEC image and upload it to the “cloud”. This went reasonably smoothly.
Then I started exercising various parts of the EC2 API (which happens to be what I’m most fluent in). I would:
- create a keypair (euca-create-keypair),
- find the image id (euca-describe-images with a bit of grep),
- run an instance (euca-run-instances),
- wait for it to go into the “running” state (euca-describe-instances),
- open up port 22 in the default security group (euca-authorize),
- find the ip (euca-describe-instances),
- connect to the guest and run a command (ssh),
- terminate the instance (euca-terminate-instances),
- close port 22 in the security group again (euca-revoke),
- delete the keypair (euca-delete-keypair),
I was using SQLite as the data store (the default in the packages) and it was known to have concurrency issues (it would timeout attempting to lock the DB), so I wrapped all euca-* commands in a retry loop that would try everything up to 10 times. This was good enough to get me started.
So, pretty soon I would see instances failing to start. However, once Jenkins was done with them, it would terminate them, and I didn’t have anything left to use for debugging. I decided to add the console log to the Jenkins output, so I just added a call to euca-get-console-output. They revealed that every so often, they’d fail to get an IP from dnsmasq. The syslog had a lot of entries from dnsmasq refusing to hand out the IP that Nova asked it to, because it already belonged to someone else. Clearly, Nova was recycling IP’s too quickly. It read through the code that was supposed to handle this several times, and it looked great. I tried drawing it on my whiteboard to see where it would fall through the cracks. Nothing. Then I tried logging the SQL for that specific operation, and it looked just fine. It wasn’t until I actually copied the sql from the logs and ran it in sqlite3’s CLI that I realised it would recycle IP’s that had just been leased. It took me hours to realise that sqlite didn’t compare these as timestamps, but as strings. They were formatted slightly differently, so it would almost always match. An 11 character patch later, this problem was solved. 1½ days of work. -11 characters. That’s about -1 character an hour. Rackspace is clearly getting their money’s worth having me work for them. I could do this all day!
That got me a bit further. Instances would now reliably come up, one at a time. I expanded out a bit, trying to run two instances at a time. This quickly blew up in my face. This time I made do with a 4 character patch. Awesome.
At this point, I’d had too many problems with sqlite locking that I got fed up. I was close to just replacing it with MySQL to get it over with, but then I decided that it just didn’t make sense. Sure, it’s a single file and we’re using it from different threads and different processes, but we’re not pounding on it. They really ought to be able to take turns. It took quite a bit of Googling and wondering, but eventually I came up with a (counting effectively changed lines of code) 4 line patch that would tell SQLAlchemy to don’t hold connections to sqlite open. Ever. That totally solved it. I was rather surprised, to be honest. I could now remove all the retry loops, and it’s worked perfectly ever since.
So far, so good. Then I decided to try to go even more agressive. I would let the three boxes all target a single one, so they’d all three run as clients against the same single-box “cloud”. I realised that because I used private addressing, I had to expand my tests and use floating ip’s to be able to reach VM’s from another box. Having done so, I realised that this didn’t work on the box itself. A 4 line patch (really only 2 lines, but I had to split them for pep8 compliance) later, and I was ready to rock and roll.
It quickly turned out that, as I had suspected, my 4 character patch earlier wasn’t broad enough, so I expanded a bit on that (4 lines modified).
Today, though, I found that surprising amount of VM’s were failing to boot, ending up with the dreaded:
General error mounting filesystems. A maintenance shell will now be started. CONTROL-D will terminate this shell and reboot the system. Give root password for maintenance (or type Control-D to continue):
I tried changing the block device type (we use virtio by default, so I tried ide and scsi), I tried not using copy-on-write images, I tried disabling any code that would touch the images. Nothing worked. I blamed the kernel, I blamed qemu, everything. I replaced everything, piece by piece, and it still failed quite often. After a long day of debugging, I ended looking at mountall. It seems Plymouth often segfaults in these settings (where the only console is a serial port), and when it does, mountall dies, killed by SIGPIPE. A 5 line (plus a bunch of comments) patch to mountall, that is still pending review, and I can now run hundreds of VM’s in a row and (5-10-ish) in parallel with no failures at all.
So, in the future, Jenkins will provide me with a great way to test drive and validate my changes, making sure that I don’t break anything, but right now, I’m extending the tests, discovering bugs and fixing them as I extend the test suite, very test-driven-development-y. It’s quite nice. At this rate, I should have pretty good test coverage pretty soon and be able to stay confident that things keep working.
It also think it’s kind of cool how much of a difference this week has made in terms of stability of the whole stack and only 19 lines of code have been touched.
In my last blog post I said that I had moved my backups from an external disk to Rackspace Cloud Files and promised I’d explain how.
Ok, so why bother? I had about 100 GB of data that was being backed up. I didn’t want to upload 99% of that, have my wifi go bonkers, and then have to start over (because Duplicity apparently isn’t very good at resuming). So, instead I wanted to make the initial backup to an external drive (the backup wouldn’t fit on my laptop’s hard drive) and defer copying it to Rackspace as time and connectivity permitted.
That was simple enough.
Once the first, full backup was made, I wanted incremental backups to go directly to Cloud Files, so I needed to get Deja-Dup to realise that there was already a backup on there.
This was the trickier bit.
When you ask Duplicity to interact with a particular backup location, it calculates a hash of the URI of it and looks that up in its cache to see if it knows about it already. If you’ve made a backup with deja-dup, you can go and look in $HOME/.cache/deja-dup. This is what I had:
soren@lenny:~$ ls -l $HOME/.cache/deja-dup/ drwxr-xr-x 2 soren soren 4096 2011-01-14 18:09 4e33cf513fa4772471272dbd07fca5be soren@lenny:~$
You see a directory named after the hash of the uri of the backup location I used, namely “file:///media/backup” (the MD5 sum of which is 4e33cf513fa4772471272dbd07fca5be).
Inside this directory, we find:
soren@lenny:~$ ls -l /home/soren/.cache/deja-dup/4e33cf513fa4772471272dbd07fca5be/ -rw------- 1 soren soren 750938885 Jan 14 15:47 duplicity-full-signatures.20110113T170937Z.sigtar.gz -rw------- 1 soren soren 653487 Jan 14 15:47 duplicity-full.20110113T170937Z.manifest soren@lenny:~$
It contains a manifest and a signature file. These files in there have no record of the backup location. That information exists only in the name of the directory. Essentially, all I needed to do was to rename the directory to match the Cloud Files location. Being a bit cautious, I decided to copy it instead. The URI for a container on Cloud Files looks like “cf+http://containername”. Knowing this, it was as simple as:
soren@lenny:~$ echo -n 'cf+http://lenny' | md5sum 2f66137249874ed1fdc952e9349912d4 - soren@lenny:~$ cd $HOME/.cache/deja-dup soren@lenny:~/.cache/deja-dup$ cp -r 4e33cf513fa4772471272dbd07fca5be 2f66137249874ed1fdc952e9349912d4
The -n option to echo is essential. Without it, I’d have been calculating the MD5 sum of the URI with a trailing newline.
Before I ran deja-dup again, I made sure the two files above were copied to Cloud Files. If I hadn’t, the first time duplicity would talk to Cloud Files, it would realise that these files don’t exist on the expected backup location, hence the local cache of them must be invalid, so it would delete them. This happened to me the first time, so making a copy rather than just renaming the directory turned out to be a good idea.
All that was left to do now was to change my backup location in Deja-Dup. This should be simple enough, so I won’t go into detail about that.
The best part about this, I think, is that wasn’t until 5-6 days later, that my upload of the initial full backup finished. However, in the mean time, I was able to do incremental backups just fine, because all it needs to do that is the signature files from the previous runs.
Oh, and to actually upload the files, I used the “st” tool from Swift. Something like this:
soren@lenny:~$ cd /media/backup soren@lenny:/media/backup$ st -A https://auth.api.rackspacecloud.com/v1.0 -U soren -K 6e6f742061206368616e636521212121 upload lenny *