Category Archives: Work

BasicDB – API completeness

I’m really excited to share that BasicDB now supports all the same queries as Amazon’s SimpleDB. SimpleDB’s API failure responses are much, much richer than BasicDB’s, but all succesful queries should yield responses indistinguishable from SimpleDB’s.

Now that it’s all implemented and tests are in place to ensure it doesn’t break, I can start cleaning up various hacks that I’ve made along the way and I can start optimizing my use of Riak. This should be fun :)

Nova scheduling

I was very happy to notice this summit session proposal by Mike Wilson for the OpenStack summit in Hong Kong. Its title is “Rethinking scheduler design” and the body reads:

Currently the scheduler exhaustively searches for an optimal solution to requirements given in a provisioning request. I would like to explore breaking down the scheduler problem in to less-than-optimal, but “good enough” answers being given. I believe this approach could deal with a couple of current problems that I see with the scheduler and also move us towards a generic scheduler framework that all of OpenStack can take advantage of:

-Scheduling requests for a deployment with hundreds of nodes take seconds to fulfill. For deployments with thousands of nodes this can be minutes.

-The global nature of the current method does not lend itself to scale and parallelism.

-There are still features that we need in the scheduler such as affinity that are difficult to express and add more complexity to the problem.

Finally. Someone gets it.

My take on this is the same as it was a couple of years ago. Yes, the scheduler is “horizontally scalable” in the sense that you can spin up N of them and have the load automatically spread evenly across them, but — as Mike points out — the problem each of them is solving grows significantly as your deployment grows. Hundreds of nodes isn’t a lot. At all. Spending seconds on a simple scheduling decision is not near good enough.

Get rid of the scheduler and replace it with a number of work queues that distribute resource requests to nodes with spare capacity. I don’t care about optimal placement. I care about placement that will suffice. Even if I did, the metrics that the current scheduler takes into account aren’t sufficient to identify “optimal placement” anyway.

Someone is inevitably going to complain that some of the advanced scheduling options don’t lend themselves well to this “scheduling” mechanism. Well.. Tough cookies. If “advanced scheduling options” prevents us from scaling beyond a few hundred nodes, the problem is “advanced scheduling options”, not the scheduling mechanism. If you never expect to grow beyond a few hundred nodes and you’re happy with scheduling decisions taking a couple of seconds, that’s great. The rest of us who are building big, big deployments need something that’ll scale.

Introducing BasicDB

Somewhat related to my two recent blog posts about the OpenStack design tenets, I’ve spent a couple of days hacking on BasicDB.

BasicDB is a new project which aims to be feature and API compatible with AWS SimpleDB. I wouldn’t mind at all for it to become an OpenStack project, but I have a hard time finding the motivation to come up with a OpenStacky API when there’s already a perfectly functional one that happens to match AWS SimpleDB. If someone wants to contribute that, that’d be great.

Anyway, it seems I need to clarify a few things with regards to BasicDB and how it relates to Trove.

If you’re familiar with AWS’s services (which you should be… They’re awesome), Trove is equivalent to RDS. It’s a service that simplifies the provisioning and management of a relational data store, typically MySQL (in AWS’s case, it can be MySQL, MS SQL Server or Oracle). So each user can utilize Trove to spin up and manage their own MySQL (or whatever) server instance.

BasicDB, on the other hand, is equivalent to SimpleDB. It exposes a basic API that lets you store data and issue queries for said data. Every user interacts with the same instance of BasicDB and it’s up to the cloud provider to set up and maintain a decent backend store for it. At the moment, there are three: A fake one (stores everything in dicts and sets), a filesystem based one (which might not be an entirely horrible solution if you have cephfs or GlusterFS backing said filesystem) or a Riak based one. The Riak based one is still somewhat naïve in that it doesn’t handle sibling reconciliation *at all* yet. More are likely to come, since they’re pretty simple to add.

OpenStack design tenets – Part 2

In my last post, I talked about how we’ve deviated from our original design tenets. I’d like to talk a bit about how we can move forward.

I guess first of all I should point out that I think the design tenets are sound and we’re doing it wrong by not following them, so the solution isn’t to just throw away the design tenets or replace them with new, shitty ones.

I should also point out that my criticism does not apply to Swift. Swift mostly gets it right.

If we want OpenStack to scale, to be resilient in the face of network failures, etc., we need to start going through the various components and see how they violate the design tenets. It’s no secret that I think our central data store is our biggest embarrassment. I cannot talk about our “distributed architecture” and then go on to talk about our central datastore while keeping a straight face.

I don’t believe there’s anything you can do to MySQL to make it acceptable for our use case. That goes for any flavour of MySQL, including Galera. You can make it “HA” in various ways, but at best you’ll be compartmentalising failures. A network partition will inevitably render a good chunk of your cloud unusable, since it won’t be able to interact completely with its datastore.

What we need is a truly distributed, fault tolerant, eventually consistent data store. Things like Riak and Cassandra spring to mind. And, while we’re at it, I think it’s time we stop dealing with the data store directly from the individual projects and instead abstract it away as a separate service that we can expose publically as well as consume internally. I know this community enjoys defining our own API’s with our own semantics, but I think we’d be doing ourselves a horrible disservice by not taking a good, hard look at AWS’s database services and working out how we can rework our datastore usage to function under the constraints these services impose.

I’m delighted to learn that we also have a queueing service in the works. As awesome as RabbitMQ is(!), it’s still a centralised component. ZeroMQ would probably solve a lot of this as well, but having an actual queueing service that we can expose publically as well as consume internall makes a ton of sense to me.

If we make these changes, that’ll take us a long way. What else do you think we should do?

OpenStack design tenets

Before OpenStack even had a name, it had its basic design tenets. The wiki history reveals that Rick wrote these down as early as May 2010, two months before OpenStack was officially launched. Let’s take a look at them:

  1. Scalability and elasticity are our main goals
  2. Any feature that limits our main goals must be optional
  3. Everything should be asynchronous
    • a) If you can’t do something asynchronously, see #2
  4. All required components must be horizontally scalable
  5. Always use shared nothing architecture (SN) or sharding
    • a) If you can’t Share nothing/shard, see #2
  6. Distribute everything
    • a) Especially logic. Move logic to where state naturally exists.
  7. Accept eventual consistency and use it where it is appropriate.
  8. Test everything.
    • a) We require tests with submitted code. (We will help you if you need it)

Now go and look at every single OpenStack diagram of Nova ever presented. Either they look something like this:

Nova diagram

or they’re lying.

Let’s focus our attention for a minute on the little thing in the middle labeled “nova database”. It’s immediately obvious that this is a shared component. That means tenet 5 (“Always use shared nothing architecture (SN) or sharding“) is out the window.

Back in 2010, the shared database was Redis, but since the redisectomy, it’s been MySQL or PostgreSQL (through SQLAlchemy). MySQL and PostgreSQL are ACID compliant, the very opposite of eventually consistent (bye bye, tenet 7). They’re wicked fast and scale very, very well. Vertically. Adios, tenet 4.

Ok, so what’s the score?

Tenet 1: Scalability and elasticity are our main goals.

Tenet 2: Any feature that limits our main goals must be optional

Tenet 3: Everything should be asynchronous

Tenet 4: All required components must be horizontally scalable

Tenet 5: Always use shared nothing architecture or sharding

Tenet 6: Distribute everything (Especially logic. Move logic to where state naturally exists).

Tenet 7: Accept eventual consistency and use it where it is appropriate.

Tenet 8: Test everything.

Is everything synchronous? Hardly. I see 258 instances of RPC call (synchronous RPC methods) vs. 133 instances of RPC cast (asynchronous RPC methods). How often each are called is anybody’s guess, but clearly there’s a fair amount of synchronous stuff going on. Sayonara, tenet 3.

Is everything distributed? No. No, it’s not. Where does the knowledge of individual compute nodes’s capacity for accepting new instances naturally exist? On the compute node itself. Where is the decision made about which compute node should run a new instance? In nova-scheduler. Sure, the scheduler is actually a scale-out internal service in the sense that there could be any number of them, but it’s making decisions on other components’s behalf. Tschüß, tenet 6.

Are we testing everything? Barely. Nova’s most recent test coverage percentage at the time of this writing is 83%. It’s much better than it once was, but there’s still a ways to go up to 100%. Adieu, tenet 8.

We can’t really live without a database, nor a scheduler, so auf wiedersehen tenet 2.

We’re left with:

Tenet 1: Scalability and elasticity are our main goals.

Tenet 2: Any feature that limits our main goals must be optional

Tenet 3: Everything should be asynchronous

Tenet 4: All required components must be horizontally scalable

Tenet 5: Always use shared nothing architecture or sharding

Tenet 6: Distribute everything (Especially logic. Move logic to where state naturally exists).

Tenet 7: Accept eventual consistency and use it where it is appropriate.

Tenet 8: Test everything.

So, the question the remains: With all the above in mind, is scalability and elasticity *really* still our main goals?

On productivity – Part I

I’ve been trying for literally years to really get Getting Things Done under my skin. I’ve read the book several times, each time gaining new insights and for a while inching towards actually using it. For some reason, I always fail at it. I’ve never really worked out why. It all makes perfect sense. I believe it’s a fantastic system, but I just can’t seem to internalise the process.

This weekend, I stumbled upon a post on Milo Casagrande’s blog where he mentioned the Pomodoro Technique. I’ve always enjoyed reading articles etc. on productivity, but somehow never heard about the Pomodoro Technique.

I read the paper and it’s a delightfully straightforward system. It explains clearly, with examples and everything, how you can use the Pomodoro Technique.

The Getting Things Done book talks a lot about concepts and process in very generic terms and (intentionally) avoids imposing tools on the readers. I understand the motivation, but I think it’s misguided. Whenever I meet someone who practices GTD, I always try to get them to explain as much as possible about the practical implementation, their choice of tools, etc, because without this, it’s hard to really get it started. I’ve spent a *lot* of time trying to find good tools, writing tools, etc., but I always wind up with something that doesn’t really work for me, so I was really excited to learn about another, popular productivity system.

I’m going to try some techniques out this week to see if I can combine GTD and the Pomodoro Technique somehow. GTD outlines some excellent concepts for organising your action items, reference material, keeping track of things you’re waiting for others to complete, as well as some very useful ways to review how well the stuff you’re doing hour by hour aligns with your short, medium, and long term goals, but — for me at least — falls short with respect to helping me actually getting started with something. Ironic, really, for a  system called “Getting Things Done”, but that’s a different story.

Moving on..

Moving on..

Seeing as the election for the OpenStack Project Policy Board is going on, it seems only fair to announce that I quite soon no longer will be working for Rackspace. Instead, I will be working (still on OpenStack) for Nebula. If this is material to your vote, I apologise for not disclosing this earlier, but it simply wasn’t finalised until a bit earlier this week.

All clear!

Testing of OpenStack

I’d like to take a couple of minutes of your time to talk about testing of OpenStack. Swift has always had very good test coverage, and Glance also does pretty well, so I’ll mostly be focused on Nova.

(Psst… If you can’t be bothered to read the whole thing, just skip down to the how you can help section.)

Unit tests

Unit tests are by far the easiest to run. They’re right there in the development tree, a simple ./ away. You don’t need a complicated hardware setup, just a source code checkout.

They each exercise a small portion of the code in isolation to verify that they live up to their “contract”. More often than not, this contract is implicit. There’s no documentation of its input, output, or side effects, and maybe there doesn’t have to be. In many cases things get split up simply for readability reasons (larger routines that have grown out of control get split into smaller chunks) or to ease testing, so they’re not actually written expecting to be called from anywhere else. Documentation for all these things would be *awesome*, but a unit test should be the minimum required.

Functional tests

Unit tests are great. However, verifying that each piece of the puzzle does what it says on the tin is of little use if putting them all together doesn’t actually do what you set out to achieve. This is where we use functional tests. An example might be verifying that when you invoke a frontend API method that is supposed to start a virtual machine, a virtual machine actually ends up getting started in a mock hypervisor with all the correct things having been put in place along the way.

In my experience, almost every time an issue is caught by this type of test, it’s an indication that the unit tests are either wrong (e.g. when X goes into a particular routine, it checks that Y comes out, but for everything else to work, Z was actually supposed to come out)  or don’t test all the edge cases. So, while a failure at this level should probably involve fixing up (or adding new) unit tests, these tests are indispensable. They verify the cooperation between the various internals, which is easy to miss when staring at each tiny little part in isolation (particularly in a piece of software like Nova that is full of side effects).

(In Nova, functional and unit tests all live in the same test suite)

Integration tests

Unit and black box tests are great, but what end users see is what really matters. If someone deploys all the various OpenStack components and put them together and something ultimately doesn’t work, we’ve failed. It’s all been futile.

Integration tests are often the easiest to write. When dealing with internals, it’s easy to punt on a lot of things like “should this method take this or that as an argument?,” “ideally, this db call shouldn’t live here, but it’ll have to do for now,” etc., but when it comes to what the end users sees, everything must have an answer. We can’t not have firm, concrete, simple, long-lived answers to questions like: “If I want to start a virtual machine, what do I do?,” “which argument comes first for this API call?,” etc. Hence, writing tests that start a virtual machine and then later makes sure that it started properly is rather forgiving. It’s also reassuring to end-users to know that their exact use cases are verified to work.

Again, ideally nothing should ever be caught here. If it does, it means that something slipped through a crack left by both the unit tests and black-box tests, or maybe the real KVM doesn’t act like we expected when we wrote its mock counterpart. Everything caught here should end up in a unit test somewhere once the culprit has been found.

Where do we stand today?

Unit and functional tests

As mentioned, nova’s source tree includes a test suite, comprised of both unit and functional tests. We have a Jenkins job that tracks how much of Nova is being exercised by the test suite. At the time of this writing, we have around 74% coverage. Bear in mind that if a particular line is exercised by either a unit test or a functional test (or both, of course). At our last design summit, we agreed that we’d work on improving this coverage, but clearly there’s a long way to go (that number should be in the (very) high nineties).

Integration tests

As for integration tests, there are a number of separate efforts:

Where are we going? (a.k.a. how you can help)

Unit and functional tests

I think this is easily where we have the most work to do. Jenkins keeps track of what is covered and what isn’t:

There’s clearly lots of room for improvement. I’d like to encourage anyone who cares about QA to grab a random bit of code that isn’t yet covered by tests and add a test for it. Feel free to start with anything small and seemingly insignificant. We need to get the ball rolling. Small changes also makes the review easier.

I’ve started going through our coverage report and filing bugs about missing unit tests. Some are just a few simple statements that need tests, others are entire modules that are almost testless. Take a look and feel free to get in touch if you need help getting started.

Integration tests

Over the next month or so, we’re hoping to collect all these efforts (and any others out there, so please let me know!) into one. The goal is to have a common set of tests that we can run against an OpenStack intallation (i.e. all the various components that make up an actual deployment) to get early warning if something should break in a particular configuration. So, if you have anything set up to automatically test OpenStack, please get in touch. If there’s a particular configuration you care about, we want to make sure we don’t break it, so we need your help finding a good way to deploy bleeding edge OpenStack code onto your test installation and run a bunch of tests against it.

PPA management tools

We use PPA’s quite heavily in OpenStack. Each of the core projects have a trunk PPA and a milestone-proposed PPA. Every commit to our bzr trunk branch results in an upload to the trunk PPA, and every commit to our milestone-proposed bzr branch results in an upload to (you guessed it) the milestone-proposed PPA. Additionally, we have a common openstack-release PPA for each of our major releases, where we combine all the projects into one PPA, for simpler distribution.

This poses a number of challenges.

We support every Ubuntu release since Lucid, but most of them lack new enough versions of various stuff (and in some cases, the packages are missing altogether). This means we backport a bunch of things to the various trunk PPA’s, and at the right moments we need to copy all these dependencies either from the trunk PPA to the milestone-proposed PPA (when we branch off for a new milestone) or from the milestone-proposed PPA to the common release PPA (at final release time).

This used to involve a lot of mucking around with Launchpad’s web UI which is not only boring and tedious (checking half a bajillion boxes is even less fun than it sounds), but also error prone, since it’s all manual.

I decided to write a number of tools to help make this simpler. So far, these tools are:


    Simply copies a package from one PPA to another.


    This one takes a number of PPA’s as arguments, and finds packages that exist in more than one of them, but at different versions. During the development cycle, this is not much of a problem since most people only run the trunk version of a single project, but when we shove them all together in one great, big PPA, it could mean that one of the projects suddenly is being run against another version of one of its dependencies than during the dev cycle.


    This one takes all the packages from one PPA and copies them to another and removes stuff from the destination PPA that’s been removed from the source PPA. It’s handy if have a PPA with all your stuff in it, it’s all been QA’ed together and is in good shape, and you want to sync it all over into a “stable” PPA in one fell swoop.


    Lists the contents of a PPA. Simple as that.

I’ve branched lp:ubuntu-archive-tools and added these tools to lp:~openstack-release/ubuntu-archive-tools/openstack. I can’t really decide if I think they belong inlp:ubuntu-archive-tools, but if someone else wants them I can look into getting them merged back.

Another week of Openstack stabilisation

I got good feedback on last week’s post about the stuff I’d achieved in Openstack, so I figured I’d do the same this week.

We left the hero of our tale (that would be me (it’s my blog, I can entitle myself however I please)) last Friday somewhat bleary eyed, hacking on a mountall patch that would more gracefully handle SIGPIPE caused by Plymouth going the way of the SIGSEGV. I got the ever awesome Scott James Remnant to review it and he (rightfully) told me to fix it in Plymouth instead. My suggested patch was much more of a workaround than a fix, but I wasn’t really in the mood to deal with Plymouth. Somehow, I had just gotten it into my head that fixing it in Plymouth would be extremely complicated. That probably had to do with the fact that I’d forgotten about MSG_NOSIGNAL for a little bit, and I imagined fixing this problem without MSG_NOSIGNAL would probably mean rewriting a bunch of I/O routines which I certainly didn’t have the brain power for at the time. Nevertheless,  a few attempts later, I got it worked out. I sent it upstream, but it seems to be stuck in the moderation queue for now.

I spent almost a day and a half wondering why some of our unit tests were failing “randomly”. It only happened every once in a while, and every time I tried running it under e.g. strace, it worked. It had “race condition” written all over it. After a lot of swearing, rude gestures and attempts to nail down the race condition, I finally noticed that it only failed if a randomly generated security group name in the test case sorted earlier than “default”, which it would do about 20% of the time. We had recently fixed DescribeSecurityGroups to return an ordered resultset which broke an assumption in this test case. Extremely annoying. My initial proposed fix was a mere 10 characters, but it ended up slightly larger, but the resulting code was easier on the eyes.

Log file handling has been a bit of an eye sore in Nova since The Big Eventlet Merge™. Since then, the Ubuntu packages have simply piped stdout and stderr to a log file and restartet the workers when the log files needed rotating. I finally got fed up with this and resurrected the logdir option and after one futile attempt, I got the log files to rotate without even reloading the workers. Sanity restored.

With all this done, I could now realiably run all the instances I wanted. However, I’d noticed that they’d all be run sequentially. Our workers, while built on top of eventlet, were single-threaded. They could only handle one RPC call at a time. This meant that if the compute worker was handling a long request (e.g. one that involved downloading a large image, and postprocessing it with copy-on-write disabled), another user just wanting to look at their instance’s console output might have to wait minutes for that request to be served. This was causing my tests to take forever to run, so a’fixin’ I went. This means that each worker can now (theoretically) handle 1024 (or any other number you choose) requests at a time.

To test this, I cranked up the concurrency of my tests so that up to 6 instances could started at the same time on each host. This worked about 80% of the time. The remaining 20% instances would entirely fail to be spawned. As could have been predicted, this was a new race condition that was uncovered because we suddenly had actual concurrency in the RPC workers. This time, iptables-restore would fail when trying to run multiple instances at the exact same time. I’ve been wanting to rework our iptables handling for a looong time anyway, so this was a great reason to get to work on that. By 2 AM between Friday and Saturday, I still wasn’t quite happy with it, so you’ll have to read the next post in this series to know how it all worked out.