Reflective monitoring

We’re using Consul for our cloud deployment. Consul makes it very easy and cheap to run health checks on every node, validating its own services. It does all the hard work of monitoring things that you used to rely on e.g. Nagios to do. The obvious drawback is that it runs locally, so you’re not validating that your service actually works across the network.

I spent a bit of time wondering how I could get node X to monitor node Y in a way that

  • fits neatly with Consul’s data model (service checks running on node A can’t easily report failures that are really on node B), and
  • doesn’t require a lot of orchestration when you have tens or hundreds of thousands of nodes that all need to keep an eye on themselves and (some subset of) each other.

I realised that I didn’t actually need another node to perform the monitoring. I just needed to have the network be a factor in the check. After learning that the MIRROR target in iptables went the way of the dodo over a decade ago, I hacked together a little script called reflector.

You simply install it on a host on your network, run e.g. “reflector –return-port 22 10022″ and any connection to port 10022 will be reflected back to the connecting node.

In other words, any node connecting to 10022 on the reflector node will actually connect to itself on port 22, except it will have traversed the network, thus ensuring that the service functions remotely.

It is availabe on pypi and Github.

The Seven-Day Weekend

Declan Meaney, a coworker at Reliance, recommended “The Seven-Day Weekend: Changing the Way Work Works” by Ricardo Semler. I’ve haven’t quite finished it yet, but it’s been incredibly inspirational so far.

Ricardo Semler is the CEO of Semco, a Brazilian company that has undergone a set of radical changes since he took over after his father in 1980.

It’s a bit hard to sum up in a few paragraphs, but I’ll give it shot anyway.

Semco is an amazingly democratic organization. Very, very few decisions are made at the top and the workers on the floor seem to have the power of veto on almost any issue.

In essence, every employee is treated like an adult. An adult whose opinions matter and who is able to make sound decisions if given enough information.

The quintessential example is how salaries are set. According to Semler, there are 5 things one needs to know in order to set one’s salary appropriately:

  1. What do people in similar positions get paid elsewhere?
  2. What do other people (with slightly different skills, more/less experience, etc.) in the organization make?
  3. How is the company doing in the marketplace? I.e. can it afford above or below average salaries?
  4. How much do I feel I should be making at this point in my carreer?
  5. How much do I feel I should be making compared to friends, family, former schoolmates, etc.?

Traditionally, 1 has been well known to everyone. 2 and 3 have been known only to the company. 4 and 5 have only been known to the individual.

Semco provides the workers with all this information. They distribute market surveys to show what people make at competing companies, they’re told what everyone (all the way from the CEO down to the janitor) at Semco makes, and share the company’s profit and forecasts openly. This gives the employees all the information needed to set their own salary appropriately.

It’s so crazy, it works.

The list of things they do differently from most companies is long and fascinating (salaries being set by employees themselves; meeting attendance being entirely optional; office hours being up to each employee on their own; travel expenses being automatically approved, but get posted on the intranet for all to see; etc.), but what interests me the most is the mindset of never just accepting everyone else’s way of doing things as the only way of doing it.

Taking the time to allow yourself and your organization to reconsider all these things as well as being able to actually think outside the box and solve the problems in a way that makes sense in the 21st century is exceptional and inspirational.

If you’re a manager in any capacity or if you’re otherwise interested in organizational and management theory, make sure you put this on your reading list.

168,000 instances in *how many hours*?

The ever awesome James Page posted an article on an experiment conducted by the Ubuntu Server team:

It’s a fine post and I’m happy that they did this experiment. However, I find myself shaking my head in disbelief at these findings. I don’t question their accuracy, I’m just disappointed with the results.

First of all, 4 years into a project whose original intent was a service provider focused platform for running public clouds, a 640 node cluster (1536 cores) shouldn’t be a rarity. If anything, it should be an unusually *small* OpenStack cloud. It’s thought provoking to me that they had to disable Neutron, tweak RabbitMQ settings, etc. for it to work at all.

Let’s look at the numbers, though. The 168,000 instances were spawned on a cluster that grew halfway through the experiment, so I’m going to ignore that particular experiment. I’m going to guess the numbers aren’t going to be prettier at that larger scale anyway.

So, apparently, they got 75,000 instances running on 374 compute nodes in 6 hours and 33 minutes (= 393 minutes). That’s an average of 191 instances launched per minute.

They got 100,000 instance running in 10 hours and 49 minutes (= 649 minutes). That’s an average of 154 instances launched per minute. That’s a rather significant drop from 191. From 6 hours and 33 minutes to 10 hours and 49 minutes is 4 hours, 16 minutes = 256 minutes. Those last 25,000 instances were launched at an average rate of 98 per minute. That’s almost half the launch rate of the first 75,000. Wow.

Looking at it another way, 374 nodes each with 4 cores gives us a total of 1496 cores. 649 minutes, 1496 cores = 970,904 core minutes. With 100,000 instances launched, that’s an average of 9.7 core minutes per instance launch.

9.7 minutes. That’s embarassing. 30 seconds would be acceptable, but something like 10 seconds should be perfectly possible measured from when the launch request is sent until its state is RUNNING, and then another 20 seconds to finish the boot sequence and start listening for SSH connections.

Depressing meeting calculations

I just did some really rather depressing calculations on meeting time.

10 people, 7 projects, weekly meeting of one hour.

Let’s pretend that 25 minutes are spent on general announcements that are genuinely useful to everyone.

The remaining 35 minutes are spent on the 7 different projects. That’s 5 minutes each. The project you’re on is obviously important to you, so that’s 5 minutes more of useful stuff.

The remaining 30 minutes are spent on 6 projects that you’re not working on. Sure, it may be interesting, but on average proably not very useful. Let’s be generous and say one minute of each of the other projects’s time is useful to you. That gives us 36 minutes (or 60%) of useful time. That’s 40% of the hour that is wasted.

Multiplied by 10 people, that’s 4 hours that’ll never come back.

Ok, let’s say the team grows: Five more people, two more projects and half an hour.

We keep the 25 minutes of general announcements.

Then there’ll be some introductory stuff. Let’s say 11 minutes. This is useful to the 5 new people and not at all to the 10 old people.

So now we have 54 minutes left to be divided across 9 projects. That’s 6 minutes each. I.e. 6 useful minutes from your own project, 8*1 useful minutes for other people’s projects and 8*5 useless minutes from other people’s projects.

Useful time:
10 old people * 39 minutes of useful time = 6:30 (43%)
5 new people * 50 minutes of useful time = 4:10 (56%)

That’s a total of 10:40 (10 hours, 40 minutes) of useful time, but 22 and a half hours spent. That’s translates into an efficiency of 48% and it’ll only get worse as the team grows, the project list grows and the meeting gets longer.

Why do we keep doing this?


I had some stuff I needed to run somewhere and had a bit of a hard time working out where I should put it. I figured I’m probably not alone with these doubts, so I decided to put up this new thing and add a bit of clarity to this jungle.

It’s still brand new, only includes information from AWS, Rackspace and HP Cloud, only looks at a small amount of features and doesn’t include any benchmarks yet, but I figured I’d share it sooner rather than later.

BasicDB – API completeness

I’m really excited to share that BasicDB now supports all the same queries as Amazon’s SimpleDB. SimpleDB’s API failure responses are much, much richer than BasicDB’s, but all succesful queries should yield responses indistinguishable from SimpleDB’s.

Now that it’s all implemented and tests are in place to ensure it doesn’t break, I can start cleaning up various hacks that I’ve made along the way and I can start optimizing my use of Riak. This should be fun :)

Nova scheduling

I was very happy to notice this summit session proposal by Mike Wilson for the OpenStack summit in Hong Kong. Its title is “Rethinking scheduler design” and the body reads:

Currently the scheduler exhaustively searches for an optimal solution to requirements given in a provisioning request. I would like to explore breaking down the scheduler problem in to less-than-optimal, but “good enough” answers being given. I believe this approach could deal with a couple of current problems that I see with the scheduler and also move us towards a generic scheduler framework that all of OpenStack can take advantage of:

-Scheduling requests for a deployment with hundreds of nodes take seconds to fulfill. For deployments with thousands of nodes this can be minutes.

-The global nature of the current method does not lend itself to scale and parallelism.

-There are still features that we need in the scheduler such as affinity that are difficult to express and add more complexity to the problem.

Finally. Someone gets it.

My take on this is the same as it was a couple of years ago. Yes, the scheduler is “horizontally scalable” in the sense that you can spin up N of them and have the load automatically spread evenly across them, but — as Mike points out — the problem each of them is solving grows significantly as your deployment grows. Hundreds of nodes isn’t a lot. At all. Spending seconds on a simple scheduling decision is not near good enough.

Get rid of the scheduler and replace it with a number of work queues that distribute resource requests to nodes with spare capacity. I don’t care about optimal placement. I care about placement that will suffice. Even if I did, the metrics that the current scheduler takes into account aren’t sufficient to identify “optimal placement” anyway.

Someone is inevitably going to complain that some of the advanced scheduling options don’t lend themselves well to this “scheduling” mechanism. Well.. Tough cookies. If “advanced scheduling options” prevents us from scaling beyond a few hundred nodes, the problem is “advanced scheduling options”, not the scheduling mechanism. If you never expect to grow beyond a few hundred nodes and you’re happy with scheduling decisions taking a couple of seconds, that’s great. The rest of us who are building big, big deployments need something that’ll scale.

BasicDB – An update

It’s been a few weeks since I’ve talked about BasicDB. I’ve been on the road, so I haven’t had much time to hack on it, but this evening I managed finish a pretty nice replacement for the previous SQL parsing and subsequent data filtering code. The old code would simply parse (and validate) the semi-SQL provided through the API and return the parsed query as a list of strings. At that point, I had to re-analyze those strings to make sense of them and apply filtering.

The new SQL parser matches the SimpleDB API much more closely in terms of what’s allowed and what isn’t, and turns the WHERE expression into essentially a tree of expressions that can be easily applied to filter items from a domain. Additionally, constructing nice Javascript code for use in the Riak database turned out to be almost as easy.

As an example, an expression like:

colour == 'blue' AND size > '5' OR shape = 'triangular'

becomes something like this:


I can simply call a .match(item) method on the top level object to check if a given item matches. If you’ve written parsers and such before, this may be very basic stuff, but I thought it was really neat :)

The Javascript code generator follows a similar pattern where I call a method on the top level object and it ends up spitting out a javascript expression that checks whether a given item matches the WHERE expression:

((vals['colour'] == 'blue') && ((vals['size'] > '5') || (vals['shape'] == 'triangular')))

Again, this is probably beginner’s stuff for someone who has written parsers and/or code generators before, but I was pretty happy with myself when all the tests all of a sudden passed :)

Introducing BasicDB

Somewhat related to my two recent blog posts about the OpenStack design tenets, I’ve spent a couple of days hacking on BasicDB.

BasicDB is a new project which aims to be feature and API compatible with AWS SimpleDB. I wouldn’t mind at all for it to become an OpenStack project, but I have a hard time finding the motivation to come up with a OpenStacky API when there’s already a perfectly functional one that happens to match AWS SimpleDB. If someone wants to contribute that, that’d be great.

Anyway, it seems I need to clarify a few things with regards to BasicDB and how it relates to Trove.

If you’re familiar with AWS’s services (which you should be… They’re awesome), Trove is equivalent to RDS. It’s a service that simplifies the provisioning and management of a relational data store, typically MySQL (in AWS’s case, it can be MySQL, MS SQL Server or Oracle). So each user can utilize Trove to spin up and manage their own MySQL (or whatever) server instance.

BasicDB, on the other hand, is equivalent to SimpleDB. It exposes a basic API that lets you store data and issue queries for said data. Every user interacts with the same instance of BasicDB and it’s up to the cloud provider to set up and maintain a decent backend store for it. At the moment, there are three: A fake one (stores everything in dicts and sets), a filesystem based one (which might not be an entirely horrible solution if you have cephfs or GlusterFS backing said filesystem) or a Riak based one. The Riak based one is still somewhat naïve in that it doesn’t handle sibling reconciliation *at all* yet. More are likely to come, since they’re pretty simple to add.

OpenStack design tenets – Part 2

In my last post, I talked about how we’ve deviated from our original design tenets. I’d like to talk a bit about how we can move forward.

I guess first of all I should point out that I think the design tenets are sound and we’re doing it wrong by not following them, so the solution isn’t to just throw away the design tenets or replace them with new, shitty ones.

I should also point out that my criticism does not apply to Swift. Swift mostly gets it right.

If we want OpenStack to scale, to be resilient in the face of network failures, etc., we need to start going through the various components and see how they violate the design tenets. It’s no secret that I think our central data store is our biggest embarrassment. I cannot talk about our “distributed architecture” and then go on to talk about our central datastore while keeping a straight face.

I don’t believe there’s anything you can do to MySQL to make it acceptable for our use case. That goes for any flavour of MySQL, including Galera. You can make it “HA” in various ways, but at best you’ll be compartmentalising failures. A network partition will inevitably render a good chunk of your cloud unusable, since it won’t be able to interact completely with its datastore.

What we need is a truly distributed, fault tolerant, eventually consistent data store. Things like Riak and Cassandra spring to mind. And, while we’re at it, I think it’s time we stop dealing with the data store directly from the individual projects and instead abstract it away as a separate service that we can expose publically as well as consume internally. I know this community enjoys defining our own API’s with our own semantics, but I think we’d be doing ourselves a horrible disservice by not taking a good, hard look at AWS’s database services and working out how we can rework our datastore usage to function under the constraints these services impose.

I’m delighted to learn that we also have a queueing service in the works. As awesome as RabbitMQ is(!), it’s still a centralised component. ZeroMQ would probably solve a lot of this as well, but having an actual queueing service that we can expose publically as well as consume internall makes a ton of sense to me.

If we make these changes, that’ll take us a long way. What else do you think we should do?