I was very happy to notice this summit session proposal by Mike Wilson for the OpenStack summit in Hong Kong. Its title is “Rethinking scheduler design” and the body reads:
Currently the scheduler exhaustively searches for an optimal solution to requirements given in a provisioning request. I would like to explore breaking down the scheduler problem in to less-than-optimal, but “good enough” answers being given. I believe this approach could deal with a couple of current problems that I see with the scheduler and also move us towards a generic scheduler framework that all of OpenStack can take advantage of:
-Scheduling requests for a deployment with hundreds of nodes take seconds to fulfill. For deployments with thousands of nodes this can be minutes.
-The global nature of the current method does not lend itself to scale and parallelism.
-There are still features that we need in the scheduler such as affinity that are difficult to express and add more complexity to the problem.
Finally. Someone gets it.
My take on this is the same as it was a couple of years ago. Yes, the scheduler is “horizontally scalable” in the sense that you can spin up N of them and have the load automatically spread evenly across them, but — as Mike points out — the problem each of them is solving grows significantly as your deployment grows. Hundreds of nodes isn’t a lot. At all. Spending seconds on a simple scheduling decision is not near good enough.
Get rid of the scheduler and replace it with a number of work queues that distribute resource requests to nodes with spare capacity. I don’t care about optimal placement. I care about placement that will suffice. Even if I did, the metrics that the current scheduler takes into account aren’t sufficient to identify “optimal placement” anyway.
Someone is inevitably going to complain that some of the advanced scheduling options don’t lend themselves well to this “scheduling” mechanism. Well.. Tough cookies. If “advanced scheduling options” prevents us from scaling beyond a few hundred nodes, the problem is “advanced scheduling options”, not the scheduling mechanism. If you never expect to grow beyond a few hundred nodes and you’re happy with scheduling decisions taking a couple of seconds, that’s great. The rest of us who are building big, big deployments need something that’ll scale.
It’s been a few weeks since I’ve talked about BasicDB. I’ve been on the road, so I haven’t had much time to hack on it, but this evening I managed finish a pretty nice replacement for the previous SQL parsing and subsequent data filtering code. The old code would simply parse (and validate) the semi-SQL provided through the API and return the parsed query as a list of strings. At that point, I had to re-analyze those strings to make sense of them and apply filtering.
As an example, an expression like:
colour == 'blue' AND size > '5' OR shape = 'triangular'
becomes something like this:
I can simply call a
.match(item) method on the top level object to check if a given item matches. If you’ve written parsers and such before, this may be very basic stuff, but I thought it was really neat
((vals['colour'] == 'blue') && ((vals['size'] > '5') || (vals['shape'] == 'triangular')))
Again, this is probably beginner’s stuff for someone who has written parsers and/or code generators before, but I was pretty happy with myself when all the tests all of a sudden passed
Somewhat related to my two recent blog posts about the OpenStack design tenets, I’ve spent a couple of days hacking on BasicDB.
BasicDB is a new project which aims to be feature and API compatible with AWS SimpleDB. I wouldn’t mind at all for it to become an OpenStack project, but I have a hard time finding the motivation to come up with a OpenStacky API when there’s already a perfectly functional one that happens to match AWS SimpleDB. If someone wants to contribute that, that’d be great.
Anyway, it seems I need to clarify a few things with regards to BasicDB and how it relates to Trove.
If you’re familiar with AWS’s services (which you should be… They’re awesome), Trove is equivalent to RDS. It’s a service that simplifies the provisioning and management of a relational data store, typically MySQL (in AWS’s case, it can be MySQL, MS SQL Server or Oracle). So each user can utilize Trove to spin up and manage their own MySQL (or whatever) server instance.
BasicDB, on the other hand, is equivalent to SimpleDB. It exposes a basic API that lets you store data and issue queries for said data. Every user interacts with the same instance of BasicDB and it’s up to the cloud provider to set up and maintain a decent backend store for it. At the moment, there are three: A fake one (stores everything in dicts and sets), a filesystem based one (which might not be an entirely horrible solution if you have cephfs or GlusterFS backing said filesystem) or a Riak based one. The Riak based one is still somewhat naïve in that it doesn’t handle sibling reconciliation *at all* yet. More are likely to come, since they’re pretty simple to add.