With OpenStack’s second release safely out the door last week, we’re now well on our way towards the next release, due out in April. This release will be focusing on stability and deployability.
To this end, I’ve set up a HudsonJenkins box that runs a bunch of tests for me. I’ve used Jenkins before, but never in this (unintentional TDD) sort of way and I’d like to share how it’s been useful to me.
I have three physical hosts. One runs Lucid, one runs Maverick, and one runs Natty. I’ve set them up as slaves of my Hudson server (which runs separately on a cloud server at Rackspace).
I started out by adding a simple install job. It would blow away existing configuration and install afresh from our trunk PPA, create an admin user, download the Natty UEC image and upload it to the “cloud”. This went reasonably smoothly.
Then I started exercising various parts of the EC2 API (which happens to be what I’m most fluent in). I would:
- create a keypair (euca-create-keypair),
- find the image id (euca-describe-images with a bit of grep),
- run an instance (euca-run-instances),
- wait for it to go into the “running” state (euca-describe-instances),
- open up port 22 in the default security group (euca-authorize),
- find the ip (euca-describe-instances),
- connect to the guest and run a command (ssh),
- terminate the instance (euca-terminate-instances),
- close port 22 in the security group again (euca-revoke),
- delete the keypair (euca-delete-keypair),
I was using SQLite as the data store (the default in the packages) and it was known to have concurrency issues (it would timeout attempting to lock the DB), so I wrapped all euca-* commands in a retry loop that would try everything up to 10 times. This was good enough to get me started.
So, pretty soon I would see instances failing to start. However, once Jenkins was done with them, it would terminate them, and I didn’t have anything left to use for debugging. I decided to add the console log to the Jenkins output, so I just added a call to euca-get-console-output. They revealed that every so often, they’d fail to get an IP from dnsmasq. The syslog had a lot of entries from dnsmasq refusing to hand out the IP that Nova asked it to, because it already belonged to someone else. Clearly, Nova was recycling IP’s too quickly. It read through the code that was supposed to handle this several times, and it looked great. I tried drawing it on my whiteboard to see where it would fall through the cracks. Nothing. Then I tried logging the SQL for that specific operation, and it looked just fine. It wasn’t until I actually copied the sql from the logs and ran it in sqlite3’s CLI that I realised it would recycle IP’s that had just been leased. It took me hours to realise that sqlite didn’t compare these as timestamps, but as strings. They were formatted slightly differently, so it would almost always match. An 11 character patch later, this problem was solved. 1½ days of work. -11 characters. That’s about -1 character an hour. Rackspace is clearly getting their money’s worth having me work for them. I could do this all day!
That got me a bit further. Instances would now reliably come up, one at a time. I expanded out a bit, trying to run two instances at a time. This quickly blew up in my face. This time I made do with a 4 character patch. Awesome.
At this point, I’d had too many problems with sqlite locking that I got fed up. I was close to just replacing it with MySQL to get it over with, but then I decided that it just didn’t make sense. Sure, it’s a single file and we’re using it from different threads and different processes, but we’re not pounding on it. They really ought to be able to take turns. It took quite a bit of Googling and wondering, but eventually I came up with a (counting effectively changed lines of code) 4 line patch that would tell SQLAlchemy to don’t hold connections to sqlite open. Ever. That totally solved it. I was rather surprised, to be honest. I could now remove all the retry loops, and it’s worked perfectly ever since.
So far, so good. Then I decided to try to go even more agressive. I would let the three boxes all target a single one, so they’d all three run as clients against the same single-box “cloud”. I realised that because I used private addressing, I had to expand my tests and use floating ip’s to be able to reach VM’s from another box. Having done so, I realised that this didn’t work on the box itself. A 4 line patch (really only 2 lines, but I had to split them for pep8 compliance) later, and I was ready to rock and roll.
It quickly turned out that, as I had suspected, my 4 character patch earlier wasn’t broad enough, so I expanded a bit on that (4 lines modified).
Today, though, I found that surprising amount of VM’s were failing to boot, ending up with the dreaded:
General error mounting filesystems. A maintenance shell will now be started. CONTROL-D will terminate this shell and reboot the system. Give root password for maintenance (or type Control-D to continue):
I tried changing the block device type (we use virtio by default, so I tried ide and scsi), I tried not using copy-on-write images, I tried disabling any code that would touch the images. Nothing worked. I blamed the kernel, I blamed qemu, everything. I replaced everything, piece by piece, and it still failed quite often. After a long day of debugging, I ended looking at mountall. It seems Plymouth often segfaults in these settings (where the only console is a serial port), and when it does, mountall dies, killed by SIGPIPE. A 5 line (plus a bunch of comments) patch to mountall, that is still pending review, and I can now run hundreds of VM’s in a row and (5-10-ish) in parallel with no failures at all.
So, in the future, Jenkins will provide me with a great way to test drive and validate my changes, making sure that I don’t break anything, but right now, I’m extending the tests, discovering bugs and fixing them as I extend the test suite, very test-driven-development-y. It’s quite nice. At this rate, I should have pretty good test coverage pretty soon and be able to stay confident that things keep working.
It also think it’s kind of cool how much of a difference this week has made in terms of stability of the whole stack and only 19 lines of code have been touched.