Tools for Counting Things Quickly

We’re using Elasticsearch as the primary – and only – datastore for Two Way Street. It’s more commonly used alongside a relational datastore for features like plain-text search, or as a high-velocity store for data like server logs. But it’s got a great deal of value for projects with a relatively fixed data-set, too. By Elasticsearch standards, Two Way Street doesn’t store anything approaching ‘big’ data: it’s just a few gigabytes of a catalogue that rarely changes. ‘Slow data’, if you like.

It’s also a dataset that doesn’t really need to be modelled relationally. The original source was a linked-data set, which isn’t traditionally relational. We’ve translated that into a more traditional key-value JSON structure, and directly inserted those JSON objects into Elasticsearch.

Two Way Street uses a single index, and each object it stores is a single thing; things have varying numbers of fields describing them.

We began the project by showing everything, getting lists of data up in our browsers, and slicing and dicing it to explore what it might tell us. On previous projects using relational databases, that necessitated breaking out every relationship into other tables, perhaps joined across join tables, just to get the necessary performance to count large numbers of relations quickly.

Elasticsearch turns that on its head: it’s incredibly fast at answer those questions. Its aggregations answer so many of the questions we have when exploring a dataset at the beginning. Questions such as:

  • how many things were acquired between two year?
  • what are the most common materials things are made of?
  • where do most things acquired in a date range come from?

So often, what we’re doing is counting and listing – usually both at once. Those counts are often predicated on complex criteria – but Elasticsearch’s aggregations make these counts very straightforward, and allows us to bundle many into a single query. For instance, the page that shows what objects were made in Japan, and which also lists which decades matching objects came from, visualises what other facets are most popular, and then enumerates the objects themselves, is just two queries in total – one of which is just used to construct the row of boxes for the decades.

(The list of available aggregations invariably gives me ideas for new queries or features – exploring ranges, or counting popularities, or categorising counts into percentiles are all available in single queries – and all without having to make any join models or complex compound indices!)

Elasticsearch is also very quick to populate. Our starting point was a set of JSON files we’d generated from original linked-data, and these could just be thrown straight into the Bulk API, which stored and indexed our data very, very quickly. One big advantage of the Bulk API is that it can used entirely separately from our web application. To ingest a lot of data without having to leave our personal computers running for long periods, we can just store the JSON on a remote server, and write a small stand-alone script to throw it at the Bulk endpoint; as we work, we watch the data available to us grow.

We haven’t even had to do any configuration of our own Elasticsearch instance: hosted services like qbox take care of that for us, meaning we can focus on design and functionality.

Projects shaped like Two Way Street tend in their early stages not to be about constructing beautiful, rigorous data-models. Rather, they’re about exploring existing data and structures quickly; counting things in lots of different ways, according to lots of different constraints; counting large numbers of things quickly. It’s easy to reach for a relational data-store without thinking – especially if, like for me, it’s the type of datastore you’re most familiar with. But there are lots of other shapes of datastore out there whose functionality is perhaps more suited to the shape of data you’re working with, and their advantages can be huge.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: