Category Archives: Cloud computing


I had some stuff I needed to run somewhere and had a bit of a hard time working out where I should put it. I figured I’m probably not alone with these doubts, so I decided to put up this new thing and add a bit of clarity to this jungle.

It’s still brand new, only includes information from AWS, Rackspace and HP Cloud, only looks at a small amount of features and doesn’t include any benchmarks yet, but I figured I’d share it sooner rather than later.

BasicDB – API completeness

I’m really excited to share that BasicDB now supports all the same queries as Amazon’s SimpleDB. SimpleDB’s API failure responses are much, much richer than BasicDB’s, but all succesful queries should yield responses indistinguishable from SimpleDB’s.

Now that it’s all implemented and tests are in place to ensure it doesn’t break, I can start cleaning up various hacks that I’ve made along the way and I can start optimizing my use of Riak. This should be fun :)

Nova scheduling

I was very happy to notice this summit session proposal by Mike Wilson for the OpenStack summit in Hong Kong. Its title is “Rethinking scheduler design” and the body reads:

Currently the scheduler exhaustively searches for an optimal solution to requirements given in a provisioning request. I would like to explore breaking down the scheduler problem in to less-than-optimal, but “good enough” answers being given. I believe this approach could deal with a couple of current problems that I see with the scheduler and also move us towards a generic scheduler framework that all of OpenStack can take advantage of:

-Scheduling requests for a deployment with hundreds of nodes take seconds to fulfill. For deployments with thousands of nodes this can be minutes.

-The global nature of the current method does not lend itself to scale and parallelism.

-There are still features that we need in the scheduler such as affinity that are difficult to express and add more complexity to the problem.

Finally. Someone gets it.

My take on this is the same as it was a couple of years ago. Yes, the scheduler is “horizontally scalable” in the sense that you can spin up N of them and have the load automatically spread evenly across them, but — as Mike points out — the problem each of them is solving grows significantly as your deployment grows. Hundreds of nodes isn’t a lot. At all. Spending seconds on a simple scheduling decision is not near good enough.

Get rid of the scheduler and replace it with a number of work queues that distribute resource requests to nodes with spare capacity. I don’t care about optimal placement. I care about placement that will suffice. Even if I did, the metrics that the current scheduler takes into account aren’t sufficient to identify “optimal placement” anyway.

Someone is inevitably going to complain that some of the advanced scheduling options don’t lend themselves well to this “scheduling” mechanism. Well.. Tough cookies. If “advanced scheduling options” prevents us from scaling beyond a few hundred nodes, the problem is “advanced scheduling options”, not the scheduling mechanism. If you never expect to grow beyond a few hundred nodes and you’re happy with scheduling decisions taking a couple of seconds, that’s great. The rest of us who are building big, big deployments need something that’ll scale.

BasicDB – An update

It’s been a few weeks since I’ve talked about BasicDB. I’ve been on the road, so I haven’t had much time to hack on it, but this evening I managed finish a pretty nice replacement for the previous SQL parsing and subsequent data filtering code. The old code would simply parse (and validate) the semi-SQL provided through the API and return the parsed query as a list of strings. At that point, I had to re-analyze those strings to make sense of them and apply filtering.

The new SQL parser matches the SimpleDB API much more closely in terms of what’s allowed and what isn’t, and turns the WHERE expression into essentially a tree of expressions that can be easily applied to filter items from a domain. Additionally, constructing nice Javascript code for use in the Riak database turned out to be almost as easy.

As an example, an expression like:

colour == 'blue' AND size > '5' OR shape = 'triangular'

becomes something like this:


I can simply call a .match(item) method on the top level object to check if a given item matches. If you’ve written parsers and such before, this may be very basic stuff, but I thought it was really neat :)

The Javascript code generator follows a similar pattern where I call a method on the top level object and it ends up spitting out a javascript expression that checks whether a given item matches the WHERE expression:

((vals['colour'] == 'blue') && ((vals['size'] > '5') || (vals['shape'] == 'triangular')))

Again, this is probably beginner’s stuff for someone who has written parsers and/or code generators before, but I was pretty happy with myself when all the tests all of a sudden passed :)

Introducing BasicDB

Somewhat related to my two recent blog posts about the OpenStack design tenets, I’ve spent a couple of days hacking on BasicDB.

BasicDB is a new project which aims to be feature and API compatible with AWS SimpleDB. I wouldn’t mind at all for it to become an OpenStack project, but I have a hard time finding the motivation to come up with a OpenStacky API when there’s already a perfectly functional one that happens to match AWS SimpleDB. If someone wants to contribute that, that’d be great.

Anyway, it seems I need to clarify a few things with regards to BasicDB and how it relates to Trove.

If you’re familiar with AWS’s services (which you should be… They’re awesome), Trove is equivalent to RDS. It’s a service that simplifies the provisioning and management of a relational data store, typically MySQL (in AWS’s case, it can be MySQL, MS SQL Server or Oracle). So each user can utilize Trove to spin up and manage their own MySQL (or whatever) server instance.

BasicDB, on the other hand, is equivalent to SimpleDB. It exposes a basic API that lets you store data and issue queries for said data. Every user interacts with the same instance of BasicDB and it’s up to the cloud provider to set up and maintain a decent backend store for it. At the moment, there are three: A fake one (stores everything in dicts and sets), a filesystem based one (which might not be an entirely horrible solution if you have cephfs or GlusterFS backing said filesystem) or a Riak based one. The Riak based one is still somewhat naïve in that it doesn’t handle sibling reconciliation *at all* yet. More are likely to come, since they’re pretty simple to add.

OpenStack design tenets – Part 2

In my last post, I talked about how we’ve deviated from our original design tenets. I’d like to talk a bit about how we can move forward.

I guess first of all I should point out that I think the design tenets are sound and we’re doing it wrong by not following them, so the solution isn’t to just throw away the design tenets or replace them with new, shitty ones.

I should also point out that my criticism does not apply to Swift. Swift mostly gets it right.

If we want OpenStack to scale, to be resilient in the face of network failures, etc., we need to start going through the various components and see how they violate the design tenets. It’s no secret that I think our central data store is our biggest embarrassment. I cannot talk about our “distributed architecture” and then go on to talk about our central datastore while keeping a straight face.

I don’t believe there’s anything you can do to MySQL to make it acceptable for our use case. That goes for any flavour of MySQL, including Galera. You can make it “HA” in various ways, but at best you’ll be compartmentalising failures. A network partition will inevitably render a good chunk of your cloud unusable, since it won’t be able to interact completely with its datastore.

What we need is a truly distributed, fault tolerant, eventually consistent data store. Things like Riak and Cassandra spring to mind. And, while we’re at it, I think it’s time we stop dealing with the data store directly from the individual projects and instead abstract it away as a separate service that we can expose publically as well as consume internally. I know this community enjoys defining our own API’s with our own semantics, but I think we’d be doing ourselves a horrible disservice by not taking a good, hard look at AWS’s database services and working out how we can rework our datastore usage to function under the constraints these services impose.

I’m delighted to learn that we also have a queueing service in the works. As awesome as RabbitMQ is(!), it’s still a centralised component. ZeroMQ would probably solve a lot of this as well, but having an actual queueing service that we can expose publically as well as consume internall makes a ton of sense to me.

If we make these changes, that’ll take us a long way. What else do you think we should do?

OpenStack design tenets

Before OpenStack even had a name, it had its basic design tenets. The wiki history reveals that Rick wrote these down as early as May 2010, two months before OpenStack was officially launched. Let’s take a look at them:

  1. Scalability and elasticity are our main goals
  2. Any feature that limits our main goals must be optional
  3. Everything should be asynchronous
    • a) If you can’t do something asynchronously, see #2
  4. All required components must be horizontally scalable
  5. Always use shared nothing architecture (SN) or sharding
    • a) If you can’t Share nothing/shard, see #2
  6. Distribute everything
    • a) Especially logic. Move logic to where state naturally exists.
  7. Accept eventual consistency and use it where it is appropriate.
  8. Test everything.
    • a) We require tests with submitted code. (We will help you if you need it)

Now go and look at every single OpenStack diagram of Nova ever presented. Either they look something like this:

Nova diagram

or they’re lying.

Let’s focus our attention for a minute on the little thing in the middle labeled “nova database”. It’s immediately obvious that this is a shared component. That means tenet 5 (“Always use shared nothing architecture (SN) or sharding“) is out the window.

Back in 2010, the shared database was Redis, but since the redisectomy, it’s been MySQL or PostgreSQL (through SQLAlchemy). MySQL and PostgreSQL are ACID compliant, the very opposite of eventually consistent (bye bye, tenet 7). They’re wicked fast and scale very, very well. Vertically. Adios, tenet 4.

Ok, so what’s the score?

Tenet 1: Scalability and elasticity are our main goals.

Tenet 2: Any feature that limits our main goals must be optional

Tenet 3: Everything should be asynchronous

Tenet 4: All required components must be horizontally scalable

Tenet 5: Always use shared nothing architecture or sharding

Tenet 6: Distribute everything (Especially logic. Move logic to where state naturally exists).

Tenet 7: Accept eventual consistency and use it where it is appropriate.

Tenet 8: Test everything.

Is everything synchronous? Hardly. I see 258 instances of RPC call (synchronous RPC methods) vs. 133 instances of RPC cast (asynchronous RPC methods). How often each are called is anybody’s guess, but clearly there’s a fair amount of synchronous stuff going on. Sayonara, tenet 3.

Is everything distributed? No. No, it’s not. Where does the knowledge of individual compute nodes’s capacity for accepting new instances naturally exist? On the compute node itself. Where is the decision made about which compute node should run a new instance? In nova-scheduler. Sure, the scheduler is actually a scale-out internal service in the sense that there could be any number of them, but it’s making decisions on other components’s behalf. Tschüß, tenet 6.

Are we testing everything? Barely. Nova’s most recent test coverage percentage at the time of this writing is 83%. It’s much better than it once was, but there’s still a ways to go up to 100%. Adieu, tenet 8.

We can’t really live without a database, nor a scheduler, so auf wiedersehen tenet 2.

We’re left with:

Tenet 1: Scalability and elasticity are our main goals.

Tenet 2: Any feature that limits our main goals must be optional

Tenet 3: Everything should be asynchronous

Tenet 4: All required components must be horizontally scalable

Tenet 5: Always use shared nothing architecture or sharding

Tenet 6: Distribute everything (Especially logic. Move logic to where state naturally exists).

Tenet 7: Accept eventual consistency and use it where it is appropriate.

Tenet 8: Test everything.

So, the question the remains: With all the above in mind, is scalability and elasticity *really* still our main goals?

Another week of Openstack stabilisation

I got good feedback on last week’s post about the stuff I’d achieved in Openstack, so I figured I’d do the same this week.

We left the hero of our tale (that would be me (it’s my blog, I can entitle myself however I please)) last Friday somewhat bleary eyed, hacking on a mountall patch that would more gracefully handle SIGPIPE caused by Plymouth going the way of the SIGSEGV. I got the ever awesome Scott James Remnant to review it and he (rightfully) told me to fix it in Plymouth instead. My suggested patch was much more of a workaround than a fix, but I wasn’t really in the mood to deal with Plymouth. Somehow, I had just gotten it into my head that fixing it in Plymouth would be extremely complicated. That probably had to do with the fact that I’d forgotten about MSG_NOSIGNAL for a little bit, and I imagined fixing this problem without MSG_NOSIGNAL would probably mean rewriting a bunch of I/O routines which I certainly didn’t have the brain power for at the time. Nevertheless,  a few attempts later, I got it worked out. I sent it upstream, but it seems to be stuck in the moderation queue for now.

I spent almost a day and a half wondering why some of our unit tests were failing “randomly”. It only happened every once in a while, and every time I tried running it under e.g. strace, it worked. It had “race condition” written all over it. After a lot of swearing, rude gestures and attempts to nail down the race condition, I finally noticed that it only failed if a randomly generated security group name in the test case sorted earlier than “default”, which it would do about 20% of the time. We had recently fixed DescribeSecurityGroups to return an ordered resultset which broke an assumption in this test case. Extremely annoying. My initial proposed fix was a mere 10 characters, but it ended up slightly larger, but the resulting code was easier on the eyes.

Log file handling has been a bit of an eye sore in Nova since The Big Eventlet Merge™. Since then, the Ubuntu packages have simply piped stdout and stderr to a log file and restartet the workers when the log files needed rotating. I finally got fed up with this and resurrected the logdir option and after one futile attempt, I got the log files to rotate without even reloading the workers. Sanity restored.

With all this done, I could now realiably run all the instances I wanted. However, I’d noticed that they’d all be run sequentially. Our workers, while built on top of eventlet, were single-threaded. They could only handle one RPC call at a time. This meant that if the compute worker was handling a long request (e.g. one that involved downloading a large image, and postprocessing it with copy-on-write disabled), another user just wanting to look at their instance’s console output might have to wait minutes for that request to be served. This was causing my tests to take forever to run, so a’fixin’ I went. This means that each worker can now (theoretically) handle 1024 (or any other number you choose) requests at a time.

To test this, I cranked up the concurrency of my tests so that up to 6 instances could started at the same time on each host. This worked about 80% of the time. The remaining 20% instances would entirely fail to be spawned. As could have been predicted, this was a new race condition that was uncovered because we suddenly had actual concurrency in the RPC workers. This time, iptables-restore would fail when trying to run multiple instances at the exact same time. I’ve been wanting to rework our iptables handling for a looong time anyway, so this was a great reason to get to work on that. By 2 AM between Friday and Saturday, I still wasn’t quite happy with it, so you’ll have to read the next post in this series to know how it all worked out.

A week into OpenStack’s third release cycle…

With OpenStack’s second release safely out the door last week, we’re now well on our way towards the next release, due out in April. This release will be focusing on stability and deployability.

To this end, I’ve set up a HudsonJenkins box that runs a bunch of tests for me. I’ve used Jenkins before, but never in this (unintentional TDD) sort of way and I’d like to share how it’s been useful to me.

I have three physical hosts. One runs Lucid, one runs Maverick, and one runs Natty. I’ve set them up as slaves of my Hudson server (which runs separately on a cloud server at Rackspace).

I started out by adding a simple install job. It would blow away existing configuration and install afresh from our trunk PPA, create an admin user, download the Natty UEC image and upload it to the “cloud”. This went reasonably smoothly.

Then I started exercising various parts of the EC2 API (which happens to be what I’m most fluent in). I would:

  1. create a keypair (euca-create-keypair),
  2. find the image id (euca-describe-images with a bit of grep),
  3. run an instance (euca-run-instances),
  4. wait for it to go into the “running” state (euca-describe-instances),
  5. open up port 22 in the default security group (euca-authorize),
  6. find the ip (euca-describe-instances),
  7. connect to the guest and run a command (ssh),
  8. terminate the instance (euca-terminate-instances),
  9. close port 22 in the security group again (euca-revoke),
  10. delete the keypair (euca-delete-keypair),

I was using SQLite as the data store (the default in the packages) and it was known to have concurrency issues (it would timeout attempting to lock the DB), so I wrapped all euca-* commands in a retry loop that would try everything up to 10 times. This was good enough to get me started.

So, pretty soon I would see instances failing to start. However, once Jenkins was done with them, it would terminate them, and I didn’t have anything left to use for debugging. I decided to add the console log to the Jenkins output, so I just added a call to euca-get-console-output. They revealed that every so often, they’d fail to get an IP from dnsmasq. The syslog had a lot of entries from dnsmasq refusing to hand out the IP that Nova asked it to, because it already belonged to someone else. Clearly, Nova was recycling IP’s too quickly. It read through the code that was supposed to handle this several times, and it looked great. I tried drawing it on my whiteboard to see where it would fall through the cracks. Nothing. Then I tried logging the SQL for that specific operation, and it looked just fine. It wasn’t until I actually copied the sql from the logs and ran it in sqlite3’s CLI that I realised it would recycle IP’s that had just been leased. It took me hours to realise that sqlite didn’t compare these as timestamps, but as strings. They were formatted slightly differently, so it would almost always match. An 11 character patch later, this problem was solved. 1½ days of work. -11 characters. That’s about -1 character an hour. Rackspace is clearly getting their money’s worth having me work for them. I could do this all day!

That got me a bit further. Instances would now reliably come up, one at a time. I expanded out a bit, trying to run two instances at a time. This quickly  blew up in my face. This time I made do with a 4 character patch. Awesome.

At this point, I’d had too many problems with sqlite locking that I got fed up. I was close to just replacing it with MySQL to get it over with, but then I decided that it just didn’t make sense. Sure, it’s a single file and we’re using it from different threads and different processes, but we’re not pounding on it. They really ought to be able to take turns. It took quite a bit of Googling and wondering, but eventually I came up with a (counting effectively changed lines of code) 4 line patch that would tell SQLAlchemy to don’t hold connections to sqlite open. Ever. That totally solved it. I was rather surprised, to be honest. I could now remove all the retry loops, and it’s worked perfectly ever since.

So far, so good. Then I decided to try to go even more agressive. I would let the three boxes all target a single one, so they’d all three run as clients against the same single-box “cloud”. I realised that because I used private addressing, I had to expand my tests and use floating ip’s to be able to reach VM’s from another box. Having done so, I realised that this didn’t work on the box itself. A 4 line patch (really only 2 lines, but I had to split them for pep8 compliance) later, and I was ready to rock and roll.

It quickly turned out that, as I had suspected, my 4 character patch earlier wasn’t broad enough, so I expanded a bit on that (4 lines modified).

Today, though, I found that surprising amount of VM’s were failing to boot, ending up with the dreaded:

General error mounting filesystems.
A maintenance shell will now be started.
CONTROL-D will terminate this shell and reboot the system.
Give root password for maintenance
(or type Control-D to continue):

I tried changing the block device type (we use virtio by default, so I tried ide and scsi), I tried not using copy-on-write images, I tried disabling any code that would touch the images. Nothing worked. I blamed the kernel, I blamed qemu, everything.  I replaced everything, piece by piece, and it still failed quite often. After a long day of debugging, I ended looking at mountall. It seems Plymouth often segfaults in these settings (where the only console is a serial port), and when it does, mountall dies, killed by SIGPIPE. A  5 line (plus a bunch of comments) patch to mountall, that is still pending review, and I can now run hundreds of VM’s in a row and (5-10-ish) in parallel with no failures at all.

So, in the future, Jenkins will provide me with a great way to test drive and validate my changes, making sure that I don’t break anything, but right now, I’m extending the tests, discovering bugs and fixing them as I extend the test suite, very test-driven-development-y. It’s quite nice. At this rate, I should have pretty good test coverage pretty soon and be able to stay confident that things keep working.

It also think it’s kind of cool how much of a difference this week has made in terms of stability of the whole stack and only 19 lines of code have been touched. :)

It only took me 20 years..

tl;dr: I now have daily backups of my laptop, powered by Rackspace Cloud Files (powered by Openstack), Deja-Dup, and Duplicity.

I’ve been using computers for a long time. If memory serves, I got my first PC when I was 9, so that’s 20 years ago now. At various times, I’ve set up some sort of backup system, but I always ended up

  • annoyed that I couldn’t acutally *use* the biggest drive I had, because it was reserved for backups,
  • annoyed because I had to go and connect the drive and do something active to get backups running, because having the disk always plugged into my system might mean the backup got toasted along with my active data when disaster struck,
  • and annoyed at a bunch of other things.

Cloud storage solves the hardest part of this. With Rackspace Cloud Files, I have access to an infinite[1] amount of storage. I can just keep pushing data, Rackspace keep them safe, and I pay for exactly how much space I’m using. Awesome.

All I need is something that can actually make backups for me and upload them to Cloud Files. I’ve known about Duplicity for a long time, and I also knew that it’s been able to talk to Cloud Files for a while, but I never got into the habit of running it at regular intervals, and running it from cron was annoying, because maybe I didn’t have my laptop on when it wanted to run, and if I wasn’t logged in, by homedir would be encrypted anyway, etc. etc. Lots of chances for failure.

Enter Deja-Dup! Deja-dup is a project spearheaded by my awesome, former colleague at Canonical, Mike Terry. It uses Duplicity on the backend, and gives me a nice, really simple frontend to get it set up. It has its own timing mechanism that runs in my GNOME desktop session. This means it only runs when my laptop is on and I’m logged in. Every once in a while, it checks how long it’s been since my last backup. If it’s more than a day, an icon pops up in the notification area that offers to run a backup. I’ve only been using this for a day, so it’s only asked me once. I’m not sure if it starts on its own if I give it long enough.

A couple of caveats:

  • Deja-dup needs a very fresh version of libnotify, which means you need to either be running Ubuntu Natty, use backported libraries, or patch Deja-dup to work with the version of libnotify in Maverick. I opted for the latter approach.
  • I have a lot of data. Around 100GB worth. Some of it is VM’s, some of it is code, some of it is various media files. Duplicity doesn’t support resuming a backup if it breaks halfway, and I “only” have 8 Mbit/s upstream bandwidth.. That meant I had to stay connected to the Internet for 28 hours straight (in a perfect world) and not have anything unexpected happen along the way. I wasn’t really interested in that, so I made my initial backup to an external drive and I’m now copying the contents of that to Rackspace at my own pace. I can stop and resume at will. The tricky part here was to get Deja-Dup to understand that the backup it thinks is on an external drive really is on Cloud Files. I’ll save that for a separate post.

[1]: Maybe not actually infinite, but infinite enough.