168,000 instances in *how many hours*?

Sponsored links

The ever awesome James Page posted an article on an experiment conducted by the Ubuntu Server team:

It’s a fine post and I’m happy that they did this experiment. However, I find myself shaking my head in disbelief at these findings. I don’t question their accuracy, I’m just disappointed with the results.

First of all, 4 years into a project whose original intent was a service provider focused platform for running public clouds, a 640 node cluster (1536 cores) shouldn’t be a rarity. If anything, it should be an unusually *small* OpenStack cloud. It’s thought provoking to me that they had to disable Neutron, tweak RabbitMQ settings, etc. for it to work at all.

Let’s look at the numbers, though. The 168,000 instances were spawned on a cluster that grew halfway through the experiment, so I’m going to ignore that particular experiment. I’m going to guess the numbers aren’t going to be prettier at that larger scale anyway.

So, apparently, they got 75,000 instances running on 374 compute nodes in 6 hours and 33 minutes (= 393 minutes). That’s an average of 191 instances launched per minute.

They got 100,000 instance running in 10 hours and 49 minutes (= 649 minutes). That’s an average of 154 instances launched per minute. That’s a rather significant drop from 191. From 6 hours and 33 minutes to 10 hours and 49 minutes is 4 hours, 16 minutes = 256 minutes. Those last 25,000 instances were launched at an average rate of 98 per minute. That’s almost half the launch rate of the first 75,000. Wow.

Looking at it another way, 374 nodes each with 4 cores gives us a total of 1496 cores. 649 minutes, 1496 cores = 970,904 core minutes. With 100,000 instances launched, that’s an average of 9.7 core minutes per instance launch.

9.7 minutes. That’s embarassing. 30 seconds would be acceptable, but something like 10 seconds should be perfectly possible measured from when the launch request is sent until its state is RUNNING, and then another 20 seconds to finish the boot sequence and start listening for SSH connections.

Depressing meeting calculations

Sponsored links

I just did some really rather depressing calculations on meeting time.

10 people, 7 projects, weekly meeting of one hour.

Let’s pretend that 25 minutes are spent on general announcements that are genuinely useful to everyone.

The remaining 35 minutes are spent on the 7 different projects. That’s 5 minutes each. The project you’re on is obviously important to you, so that’s 5 minutes more of useful stuff.

The remaining 30 minutes are spent on 6 projects that you’re not working on. Sure, it may be interesting, but on average proably not very useful. Let’s be generous and say one minute of each of the other projects’s time is useful to you. That gives us 36 minutes (or 60%) of useful time. That’s 40% of the hour that is wasted.

Multiplied by 10 people, that’s 4 hours that’ll never come back.

Ok, let’s say the team grows: Five more people, two more projects and half an hour.

We keep the 25 minutes of general announcements.

Then there’ll be some introductory stuff. Let’s say 11 minutes. This is useful to the 5 new people and not at all to the 10 old people.

So now we have 54 minutes left to be divided across 9 projects. That’s 6 minutes each. I.e. 6 useful minutes from your own project, 8*1 useful minutes for other people’s projects and 8*5 useless minutes from other people’s projects.

Useful time:
10 old people * 39 minutes of useful time = 6:30 (43%)
5 new people * 50 minutes of useful time = 4:10 (56%)

That’s a total of 10:40 (10 hours, 40 minutes) of useful time, but 22 and a half hours spent. That’s translates into an efficiency of 48% and it’ll only get worse as the team grows, the project list grows and the meeting gets longer.

Why do we keep doing this?

Launching CloudProviderComparison.com

Sponsored links

I had some stuff I needed to run somewhere and had a bit of a hard time working out where I should put it. I figured I’m probably not alone with these doubts, so I decided to put up this new thing and add a bit of clarity to this jungle.

It’s still brand new, only includes information from AWS, Rackspace and HP Cloud, only looks at a small amount of features and doesn’t include any benchmarks yet, but I figured I’d share it sooner rather than later.

BasicDB – API completeness

I’m really excited to share that BasicDB now supports all the same queries as Amazon’s SimpleDB. SimpleDB’s API failure responses are much, much richer than BasicDB’s, but all succesful queries should yield responses indistinguishable from SimpleDB’s.

Now that it’s all implemented and tests are in place to ensure it doesn’t break, I can start cleaning up various hacks that I’ve made along the way and I can start optimizing my use of Riak. This should be fun :)

Nova scheduling

I was very happy to notice this summit session proposal by Mike Wilson for the OpenStack summit in Hong Kong. Its title is “Rethinking scheduler design” and the body reads:

Currently the scheduler exhaustively searches for an optimal solution to requirements given in a provisioning request. I would like to explore breaking down the scheduler problem in to less-than-optimal, but “good enough” answers being given. I believe this approach could deal with a couple of current problems that I see with the scheduler and also move us towards a generic scheduler framework that all of OpenStack can take advantage of:

-Scheduling requests for a deployment with hundreds of nodes take seconds to fulfill. For deployments with thousands of nodes this can be minutes.

-The global nature of the current method does not lend itself to scale and parallelism.

-There are still features that we need in the scheduler such as affinity that are difficult to express and add more complexity to the problem.

Finally. Someone gets it.

My take on this is the same as it was a couple of years ago. Yes, the scheduler is “horizontally scalable” in the sense that you can spin up N of them and have the load automatically spread evenly across them, but — as Mike points out — the problem each of them is solving grows significantly as your deployment grows. Hundreds of nodes isn’t a lot. At all. Spending seconds on a simple scheduling decision is not near good enough.

Get rid of the scheduler and replace it with a number of work queues that distribute resource requests to nodes with spare capacity. I don’t care about optimal placement. I care about placement that will suffice. Even if I did, the metrics that the current scheduler takes into account aren’t sufficient to identify “optimal placement” anyway.

Someone is inevitably going to complain that some of the advanced scheduling options don’t lend themselves well to this “scheduling” mechanism. Well.. Tough cookies. If “advanced scheduling options” prevents us from scaling beyond a few hundred nodes, the problem is “advanced scheduling options”, not the scheduling mechanism. If you never expect to grow beyond a few hundred nodes and you’re happy with scheduling decisions taking a couple of seconds, that’s great. The rest of us who are building big, big deployments need something that’ll scale.

BasicDB – An update

It’s been a few weeks since I’ve talked about BasicDB. I’ve been on the road, so I haven’t had much time to hack on it, but this evening I managed finish a pretty nice replacement for the previous SQL parsing and subsequent data filtering code. The old code would simply parse (and validate) the semi-SQL provided through the API and return the parsed query as a list of strings. At that point, I had to re-analyze those strings to make sense of them and apply filtering.

The new SQL parser matches the SimpleDB API much more closely in terms of what’s allowed and what isn’t, and turns the WHERE expression into essentially a tree of expressions that can be easily applied to filter items from a domain. Additionally, constructing nice Javascript code for use in the Riak database turned out to be almost as easy.

As an example, an expression like:

colour == 'blue' AND size > '5' OR shape = 'triangular'

becomes something like this:


I can simply call a .match(item) method on the top level object to check if a given item matches. If you’ve written parsers and such before, this may be very basic stuff, but I thought it was really neat :)

The Javascript code generator follows a similar pattern where I call a method on the top level object and it ends up spitting out a javascript expression that checks whether a given item matches the WHERE expression:

((vals['colour'] == 'blue') && ((vals['size'] > '5') || (vals['shape'] == 'triangular')))

Again, this is probably beginner’s stuff for someone who has written parsers and/or code generators before, but I was pretty happy with myself when all the tests all of a sudden passed :)

Introducing BasicDB

Somewhat related to my two recent blog posts about the OpenStack design tenets, I’ve spent a couple of days hacking on BasicDB.

BasicDB is a new project which aims to be feature and API compatible with AWS SimpleDB. I wouldn’t mind at all for it to become an OpenStack project, but I have a hard time finding the motivation to come up with a OpenStacky API when there’s already a perfectly functional one that happens to match AWS SimpleDB. If someone wants to contribute that, that’d be great.

Anyway, it seems I need to clarify a few things with regards to BasicDB and how it relates to Trove.

If you’re familiar with AWS’s services (which you should be… They’re awesome), Trove is equivalent to RDS. It’s a service that simplifies the provisioning and management of a relational data store, typically MySQL (in AWS’s case, it can be MySQL, MS SQL Server or Oracle). So each user can utilize Trove to spin up and manage their own MySQL (or whatever) server instance.

BasicDB, on the other hand, is equivalent to SimpleDB. It exposes a basic API that lets you store data and issue queries for said data. Every user interacts with the same instance of BasicDB and it’s up to the cloud provider to set up and maintain a decent backend store for it. At the moment, there are three: A fake one (stores everything in dicts and sets), a filesystem based one (which might not be an entirely horrible solution if you have cephfs or GlusterFS backing said filesystem) or a Riak based one. The Riak based one is still somewhat naïve in that it doesn’t handle sibling reconciliation *at all* yet. More are likely to come, since they’re pretty simple to add.

OpenStack design tenets – Part 2

In my last post, I talked about how we’ve deviated from our original design tenets. I’d like to talk a bit about how we can move forward.

I guess first of all I should point out that I think the design tenets are sound and we’re doing it wrong by not following them, so the solution isn’t to just throw away the design tenets or replace them with new, shitty ones.

I should also point out that my criticism does not apply to Swift. Swift mostly gets it right.

If we want OpenStack to scale, to be resilient in the face of network failures, etc., we need to start going through the various components and see how they violate the design tenets. It’s no secret that I think our central data store is our biggest embarrassment. I cannot talk about our “distributed architecture” and then go on to talk about our central datastore while keeping a straight face.

I don’t believe there’s anything you can do to MySQL to make it acceptable for our use case. That goes for any flavour of MySQL, including Galera. You can make it “HA” in various ways, but at best you’ll be compartmentalising failures. A network partition will inevitably render a good chunk of your cloud unusable, since it won’t be able to interact completely with its datastore.

What we need is a truly distributed, fault tolerant, eventually consistent data store. Things like Riak and Cassandra spring to mind. And, while we’re at it, I think it’s time we stop dealing with the data store directly from the individual projects and instead abstract it away as a separate service that we can expose publically as well as consume internally. I know this community enjoys defining our own API’s with our own semantics, but I think we’d be doing ourselves a horrible disservice by not taking a good, hard look at AWS’s database services and working out how we can rework our datastore usage to function under the constraints these services impose.

I’m delighted to learn that we also have a queueing service in the works. As awesome as RabbitMQ is(!), it’s still a centralised component. ZeroMQ would probably solve a lot of this as well, but having an actual queueing service that we can expose publically as well as consume internall makes a ton of sense to me.

If we make these changes, that’ll take us a long way. What else do you think we should do?

OpenStack design tenets

Before OpenStack even had a name, it had its basic design tenets. The wiki history reveals that Rick wrote these down as early as May 2010, two months before OpenStack was officially launched. Let’s take a look at them:

  1. Scalability and elasticity are our main goals
  2. Any feature that limits our main goals must be optional
  3. Everything should be asynchronous
    • a) If you can’t do something asynchronously, see #2
  4. All required components must be horizontally scalable
  5. Always use shared nothing architecture (SN) or sharding
    • a) If you can’t Share nothing/shard, see #2
  6. Distribute everything
    • a) Especially logic. Move logic to where state naturally exists.
  7. Accept eventual consistency and use it where it is appropriate.
  8. Test everything.
    • a) We require tests with submitted code. (We will help you if you need it)

Now go and look at every single OpenStack diagram of Nova ever presented. Either they look something like this:

Nova diagram

or they’re lying.

Let’s focus our attention for a minute on the little thing in the middle labeled “nova database”. It’s immediately obvious that this is a shared component. That means tenet 5 (“Always use shared nothing architecture (SN) or sharding“) is out the window.

Back in 2010, the shared database was Redis, but since the redisectomy, it’s been MySQL or PostgreSQL (through SQLAlchemy). MySQL and PostgreSQL are ACID compliant, the very opposite of eventually consistent (bye bye, tenet 7). They’re wicked fast and scale very, very well. Vertically. Adios, tenet 4.

Ok, so what’s the score?

Tenet 1: Scalability and elasticity are our main goals.

Tenet 2: Any feature that limits our main goals must be optional

Tenet 3: Everything should be asynchronous

Tenet 4: All required components must be horizontally scalable

Tenet 5: Always use shared nothing architecture or sharding

Tenet 6: Distribute everything (Especially logic. Move logic to where state naturally exists).

Tenet 7: Accept eventual consistency and use it where it is appropriate.

Tenet 8: Test everything.

Is everything synchronous? Hardly. I see 258 instances of RPC call (synchronous RPC methods) vs. 133 instances of RPC cast (asynchronous RPC methods). How often each are called is anybody’s guess, but clearly there’s a fair amount of synchronous stuff going on. Sayonara, tenet 3.

Is everything distributed? No. No, it’s not. Where does the knowledge of individual compute nodes’s capacity for accepting new instances naturally exist? On the compute node itself. Where is the decision made about which compute node should run a new instance? In nova-scheduler. Sure, the scheduler is actually a scale-out internal service in the sense that there could be any number of them, but it’s making decisions on other components’s behalf. Tschüß, tenet 6.

Are we testing everything? Barely. Nova’s most recent test coverage percentage at the time of this writing is 83%. It’s much better than it once was, but there’s still a ways to go up to 100%. Adieu, tenet 8.

We can’t really live without a database, nor a scheduler, so auf wiedersehen tenet 2.

We’re left with:

Tenet 1: Scalability and elasticity are our main goals.

Tenet 2: Any feature that limits our main goals must be optional

Tenet 3: Everything should be asynchronous

Tenet 4: All required components must be horizontally scalable

Tenet 5: Always use shared nothing architecture or sharding

Tenet 6: Distribute everything (Especially logic. Move logic to where state naturally exists).

Tenet 7: Accept eventual consistency and use it where it is appropriate.

Tenet 8: Test everything.

So, the question the remains: With all the above in mind, is scalability and elasticity *really* still our main goals?

On productivity – Part I

I’ve been trying for literally years to really get Getting Things Done under my skin. I’ve read the book several times, each time gaining new insights and for a while inching towards actually using it. For some reason, I always fail at it. I’ve never really worked out why. It all makes perfect sense. I believe it’s a fantastic system, but I just can’t seem to internalise the process.

This weekend, I stumbled upon a post on Milo Casagrande’s blog where he mentioned the Pomodoro Technique. I’ve always enjoyed reading articles etc. on productivity, but somehow never heard about the Pomodoro Technique.

I read the paper and it’s a delightfully straightforward system. It explains clearly, with examples and everything, how you can use the Pomodoro Technique.

The Getting Things Done book talks a lot about concepts and process in very generic terms and (intentionally) avoids imposing tools on the readers. I understand the motivation, but I think it’s misguided. Whenever I meet someone who practices GTD, I always try to get them to explain as much as possible about the practical implementation, their choice of tools, etc, because without this, it’s hard to really get it started. I’ve spent a *lot* of time trying to find good tools, writing tools, etc., but I always wind up with something that doesn’t really work for me, so I was really excited to learn about another, popular productivity system.

I’m going to try some techniques out this week to see if I can combine GTD and the Pomodoro Technique somehow. GTD outlines some excellent concepts for organising your action items, reference material, keeping track of things you’re waiting for others to complete, as well as some very useful ways to review how well the stuff you’re doing hour by hour aligns with your short, medium, and long term goals, but — for me at least — falls short with respect to helping me actually getting started with something. Ironic, really, for a  system called “Getting Things Done”, but that’s a different story.