Category Archives: Code

BasicDB – API completeness

I’m really excited to share that BasicDB now supports all the same queries as Amazon’s SimpleDB. SimpleDB’s API failure responses are much, much richer than BasicDB’s, but all succesful queries should yield responses indistinguishable from SimpleDB’s.

Now that it’s all implemented and tests are in place to ensure it doesn’t break, I can start cleaning up various hacks that I’ve made along the way and I can start optimizing my use of Riak. This should be fun :)

Nova scheduling

I was very happy to notice this summit session proposal by Mike Wilson for the OpenStack summit in Hong Kong. Its title is “Rethinking scheduler design” and the body reads:

Currently the scheduler exhaustively searches for an optimal solution to requirements given in a provisioning request. I would like to explore breaking down the scheduler problem in to less-than-optimal, but “good enough” answers being given. I believe this approach could deal with a couple of current problems that I see with the scheduler and also move us towards a generic scheduler framework that all of OpenStack can take advantage of:

-Scheduling requests for a deployment with hundreds of nodes take seconds to fulfill. For deployments with thousands of nodes this can be minutes.

-The global nature of the current method does not lend itself to scale and parallelism.

-There are still features that we need in the scheduler such as affinity that are difficult to express and add more complexity to the problem.

Finally. Someone gets it.

My take on this is the same as it was a couple of years ago. Yes, the scheduler is “horizontally scalable” in the sense that you can spin up N of them and have the load automatically spread evenly across them, but — as Mike points out — the problem each of them is solving grows significantly as your deployment grows. Hundreds of nodes isn’t a lot. At all. Spending seconds on a simple scheduling decision is not near good enough.

Get rid of the scheduler and replace it with a number of work queues that distribute resource requests to nodes with spare capacity. I don’t care about optimal placement. I care about placement that will suffice. Even if I did, the metrics that the current scheduler takes into account aren’t sufficient to identify “optimal placement” anyway.

Someone is inevitably going to complain that some of the advanced scheduling options don’t lend themselves well to this “scheduling” mechanism. Well.. Tough cookies. If “advanced scheduling options” prevents us from scaling beyond a few hundred nodes, the problem is “advanced scheduling options”, not the scheduling mechanism. If you never expect to grow beyond a few hundred nodes and you’re happy with scheduling decisions taking a couple of seconds, that’s great. The rest of us who are building big, big deployments need something that’ll scale.

BasicDB – An update

It’s been a few weeks since I’ve talked about BasicDB. I’ve been on the road, so I haven’t had much time to hack on it, but this evening I managed finish a pretty nice replacement for the previous SQL parsing and subsequent data filtering code. The old code would simply parse (and validate) the semi-SQL provided through the API and return the parsed query as a list of strings. At that point, I had to re-analyze those strings to make sense of them and apply filtering.

The new SQL parser matches the SimpleDB API much more closely in terms of what’s allowed and what isn’t, and turns the WHERE expression into essentially a tree of expressions that can be easily applied to filter items from a domain. Additionally, constructing nice Javascript code for use in the Riak database turned out to be almost as easy.

As an example, an expression like:

colour == 'blue' AND size > '5' OR shape = 'triangular'

becomes something like this:


I can simply call a .match(item) method on the top level object to check if a given item matches. If you’ve written parsers and such before, this may be very basic stuff, but I thought it was really neat :)

The Javascript code generator follows a similar pattern where I call a method on the top level object and it ends up spitting out a javascript expression that checks whether a given item matches the WHERE expression:

((vals['colour'] == 'blue') && ((vals['size'] > '5') || (vals['shape'] == 'triangular')))

Again, this is probably beginner’s stuff for someone who has written parsers and/or code generators before, but I was pretty happy with myself when all the tests all of a sudden passed :)

Introducing BasicDB

Somewhat related to my two recent blog posts about the OpenStack design tenets, I’ve spent a couple of days hacking on BasicDB.

BasicDB is a new project which aims to be feature and API compatible with AWS SimpleDB. I wouldn’t mind at all for it to become an OpenStack project, but I have a hard time finding the motivation to come up with a OpenStacky API when there’s already a perfectly functional one that happens to match AWS SimpleDB. If someone wants to contribute that, that’d be great.

Anyway, it seems I need to clarify a few things with regards to BasicDB and how it relates to Trove.

If you’re familiar with AWS’s services (which you should be… They’re awesome), Trove is equivalent to RDS. It’s a service that simplifies the provisioning and management of a relational data store, typically MySQL (in AWS’s case, it can be MySQL, MS SQL Server or Oracle). So each user can utilize Trove to spin up and manage their own MySQL (or whatever) server instance.

BasicDB, on the other hand, is equivalent to SimpleDB. It exposes a basic API that lets you store data and issue queries for said data. Every user interacts with the same instance of BasicDB and it’s up to the cloud provider to set up and maintain a decent backend store for it. At the moment, there are three: A fake one (stores everything in dicts and sets), a filesystem based one (which might not be an entirely horrible solution if you have cephfs or GlusterFS backing said filesystem) or a Riak based one. The Riak based one is still somewhat naïve in that it doesn’t handle sibling reconciliation *at all* yet. More are likely to come, since they’re pretty simple to add.

OpenStack design tenets – Part 2

In my last post, I talked about how we’ve deviated from our original design tenets. I’d like to talk a bit about how we can move forward.

I guess first of all I should point out that I think the design tenets are sound and we’re doing it wrong by not following them, so the solution isn’t to just throw away the design tenets or replace them with new, shitty ones.

I should also point out that my criticism does not apply to Swift. Swift mostly gets it right.

If we want OpenStack to scale, to be resilient in the face of network failures, etc., we need to start going through the various components and see how they violate the design tenets. It’s no secret that I think our central data store is our biggest embarrassment. I cannot talk about our “distributed architecture” and then go on to talk about our central datastore while keeping a straight face.

I don’t believe there’s anything you can do to MySQL to make it acceptable for our use case. That goes for any flavour of MySQL, including Galera. You can make it “HA” in various ways, but at best you’ll be compartmentalising failures. A network partition will inevitably render a good chunk of your cloud unusable, since it won’t be able to interact completely with its datastore.

What we need is a truly distributed, fault tolerant, eventually consistent data store. Things like Riak and Cassandra spring to mind. And, while we’re at it, I think it’s time we stop dealing with the data store directly from the individual projects and instead abstract it away as a separate service that we can expose publically as well as consume internally. I know this community enjoys defining our own API’s with our own semantics, but I think we’d be doing ourselves a horrible disservice by not taking a good, hard look at AWS’s database services and working out how we can rework our datastore usage to function under the constraints these services impose.

I’m delighted to learn that we also have a queueing service in the works. As awesome as RabbitMQ is(!), it’s still a centralised component. ZeroMQ would probably solve a lot of this as well, but having an actual queueing service that we can expose publically as well as consume internall makes a ton of sense to me.

If we make these changes, that’ll take us a long way. What else do you think we should do?

OpenStack design tenets

Before OpenStack even had a name, it had its basic design tenets. The wiki history reveals that Rick wrote these down as early as May 2010, two months before OpenStack was officially launched. Let’s take a look at them:

  1. Scalability and elasticity are our main goals
  2. Any feature that limits our main goals must be optional
  3. Everything should be asynchronous
    • a) If you can’t do something asynchronously, see #2
  4. All required components must be horizontally scalable
  5. Always use shared nothing architecture (SN) or sharding
    • a) If you can’t Share nothing/shard, see #2
  6. Distribute everything
    • a) Especially logic. Move logic to where state naturally exists.
  7. Accept eventual consistency and use it where it is appropriate.
  8. Test everything.
    • a) We require tests with submitted code. (We will help you if you need it)

Now go and look at every single OpenStack diagram of Nova ever presented. Either they look something like this:

Nova diagram

or they’re lying.

Let’s focus our attention for a minute on the little thing in the middle labeled “nova database”. It’s immediately obvious that this is a shared component. That means tenet 5 (“Always use shared nothing architecture (SN) or sharding“) is out the window.

Back in 2010, the shared database was Redis, but since the redisectomy, it’s been MySQL or PostgreSQL (through SQLAlchemy). MySQL and PostgreSQL are ACID compliant, the very opposite of eventually consistent (bye bye, tenet 7). They’re wicked fast and scale very, very well. Vertically. Adios, tenet 4.

Ok, so what’s the score?

Tenet 1: Scalability and elasticity are our main goals.

Tenet 2: Any feature that limits our main goals must be optional

Tenet 3: Everything should be asynchronous

Tenet 4: All required components must be horizontally scalable

Tenet 5: Always use shared nothing architecture or sharding

Tenet 6: Distribute everything (Especially logic. Move logic to where state naturally exists).

Tenet 7: Accept eventual consistency and use it where it is appropriate.

Tenet 8: Test everything.

Is everything synchronous? Hardly. I see 258 instances of RPC call (synchronous RPC methods) vs. 133 instances of RPC cast (asynchronous RPC methods). How often each are called is anybody’s guess, but clearly there’s a fair amount of synchronous stuff going on. Sayonara, tenet 3.

Is everything distributed? No. No, it’s not. Where does the knowledge of individual compute nodes’s capacity for accepting new instances naturally exist? On the compute node itself. Where is the decision made about which compute node should run a new instance? In nova-scheduler. Sure, the scheduler is actually a scale-out internal service in the sense that there could be any number of them, but it’s making decisions on other components’s behalf. Tschüß, tenet 6.

Are we testing everything? Barely. Nova’s most recent test coverage percentage at the time of this writing is 83%. It’s much better than it once was, but there’s still a ways to go up to 100%. Adieu, tenet 8.

We can’t really live without a database, nor a scheduler, so auf wiedersehen tenet 2.

We’re left with:

Tenet 1: Scalability and elasticity are our main goals.

Tenet 2: Any feature that limits our main goals must be optional

Tenet 3: Everything should be asynchronous

Tenet 4: All required components must be horizontally scalable

Tenet 5: Always use shared nothing architecture or sharding

Tenet 6: Distribute everything (Especially logic. Move logic to where state naturally exists).

Tenet 7: Accept eventual consistency and use it where it is appropriate.

Tenet 8: Test everything.

So, the question the remains: With all the above in mind, is scalability and elasticity *really* still our main goals?


All the things wrong with monitoring today – Part 3

Erk. I found this sitting around as a draft:

Today’s normality is tomorrow’s abnormality

Last time, we looked at a disk usage graph. This week, we’ll look at CPU usage or something else that goes up and down instead of just up, up, up.

What’s the problem here? The problem is that it’s very hard to set up an alert for this. Some things are simply spiky by nature. Sometimes, that’s perfectly fine. Perhaps the load on this particular application is evenly distributed throughout the day, but at night it runs a bunch of batch processing jobs that pegs the CPU for a couple of hours. For this sort of thing, you have a couple of options in terms of monitoring/alerting.

  • Don’t monitor CPU load.
  • Accept being alerted about this every single night.
  • Ignore CPU load during the time of day when this job runs.

All of these options suck.

  • You can’t just not monitor the CPU load. If you’re suddenly at 100% for an hour during the day, something’s wrong!
  • You don’t want to be alerted by something that is normal. That’s silly. You want your monitoring system to only alert you about stuff that’s worth waking up over.
  • Ignoring the CPU load based on the time of day is a step in the right direction, but this is not an isolated case. You probably have many different services, all with different usage patterns. I also don’t really want to think about what it will do to your configuration files if you had to specify different thresholds for every hour of the day (and every day of the week, etc).

Think about that last option a bit.. What would you use to define expected/acceptable levels? Pure guesswork? Of course not. You’ll use the data you already have. Maybe you’ve run this for a while and have cute graphs that can tell you what is expected. But seriously… Looking at graphs from your monitoring system and using them to type configuration back into your monitoring system? That’s the most ridiculous thing I’ve ever heard (yes, I should probably get out more).

Why can’t the monitoring system just tell me when something is out of the ordinary? It has all the data in the world to make that call. If a metric is unusual for that time of day, on that day of week, at that time of year, let me know. If it’s very unusual, send me a text message. Otherwise, I probably don’t care.


All the things wrong with monitoring today – Part 2

I took much longer to post this than intended. Not that I expect anyone to have been sitting on the edge of their chair waiting for it, but still…

It’s been more than month since my last post, and not a darned thing has changed. Monitoring today still sucks. In the last installment I ranted and moaned about “active” monitoring and how there’s all this information you’re not collecting that is being lost. This time I’ll becry the sorry handling of the data we actually do collect.

Temporal tunnelvision

Let’s for the sake of this argument pretend that “pinging” a web service is actually a useful thing to do. A typical scenario is this: A monitoring server tries to fetch some URL. If it takes less than a second to respond, it’s considered UP (typically resulting in a calming green row in the monitoring system). Kinda like this:

If it’s more than a second, but less than say 5 seconds, it’s considered WARNING (typically indicated by a yellow row in the monitoring system), or if it hasn’t responded within 5 seconds, it’s considered DOWN (resulting in a red row).

Transitions between states often result in an alert being sent out. These alerts typically contain the actual data point that triggered the transition:

"Oh, noes! HTTP changed state to WARNING. It took 1.455 seconds to respond."

It’s sad really, but the data point mentioned in the alert and the most recent you can see in the monitoring system’s web UI are often the only “record” of these data points. “Sad? Who cares? It’s all in the past!”.. *sigh* No. A wise man once said “those who ignore history are doomed to get bitten in the arse by it at some point” (paraphrasing ever so slightly). Here’s why:

Let’s look at a typical disk usage graph:

Sure, your graphs may be slightly bumpier, but this is basically how these things look. It doesn’t take a ph.d. in statistics to figure out where that blue line is headed (towards the red area, if you hadn’t worked it out).

Say that that’s a graph for the last week. If you imagine you’re extending the line, you can see that the disks will be full in about another week and within the red area just a couple of days from now. Yikes.

The point here is that if you were limited by the temporal tunnelvision of today’s monitoring systems, all you’d have seen was a green row all along. You’d think everything was fine until it suddenly wasn’t. Sadly, lots of people happily ignore this information on a daily basis. Even if they actually do collect this information and make pretty graphs out of it, it’s not something you go and look at very often to see these trends. It’s used mostly as a debugging tool after the fact (“Oh, I just got an alert that the disk on server X is running full… Yup, the graph confirms it.”).

I’m not advocating spending all your precious time sifting through graphs, looking for metrics on a collision course with disaster. Sure, if you only have a few servers, it’s not that big of a deal to look at the disk usage graphs every couple of days and see where they’re headed. If you have a thousand servers, though, it’s a pretty big deal.

So what am I advocating? I want a monitoring system that doesn’t just tell me when a disk has entered the yellow area on the shit-is-about-to-hit-the-fan-o-meter. I want a monitoring system that tells me when the next filesystem is likely to enter the yellow area on said meter. See? Instead of a “current problems list”, I want a “These are the next problems you’re likely to have to deal with” list. I want it to feed into my calendar, so that I don’t accidentally schedule a day off on the same day /var on my db server is going to run full. I want it to automatically add a TODO item to my Remember the Milk account telling me to buy more/bigger drives for my file server.

It shouldn’t be that hard!


All the things wrong with monitoring today – Part 1

Monitoring today sucks. Big time. It sucks so bad, it’s not even funny. The amount of time spent configuring stuff, dealing with problems when it’s already too late, and the amount of things your monitoring system could be monitoring, but isn’t, are all staggering. I’ll be spending a couple of posts whining about this. Who knows? Maybe I have a solution up my sleeve by the end of it.

Active vs. passive monitoring

Active monitoring.. It sounds cool. Way cooler than passive. Most of the time, if you have a choice between an active and a passive something, you go with the active one, right? Well, not this time.

The amount of times I’ve seen people set up their monitoring system to access an HTTP URL especially crafted to be useless, to simply respond to the probe as quickly as possible, is ridiculous. It’s surely active, but it’s almost entirely useless. Sure, if this is a service noone uses, it’s probably fine, but if this is a service that has almost any sort of real world use, in the customary 5 minutes between each of these “pings”, there will have been dozens, scores, if not hundreds or thousands of actual requests. Requests that actually did something. Exercised your service at least to some extent. Sadly, this information is almost universally ignored.

Telling Apache to log the amount of time it took to serve a request is trivial. Collecting this information is trivial. Feeding that data to your monitoring system (if not on a per-request basis, just a maximum request time over the last 10 seconds would be a vast improvement) really shouldn’t be too hard. So why don’t you?