DateRanger – a new tool to share

This is a post by Nat Buckley, of Buckley Williams, our recent collaborators on the new Postal Museum Touch Table project.

As humans we find it easy to quickly read a variety of date formats and almost instantly understand what they represent. We can glance at dates written as “23rd Jan 1894”, “Jan-Mar 1856”, “c. 1960” or “1945-1949” and we know how big the date range is and how accurate it might be. These are just a few examples of dates formats we found in the Postal Museum collections.

To work with those dates effectively in our software we needed to find a way to parse them and represent them in a single format. There are many software libraries designed for translating between standardised date formats (eg. ones used by different countries), but parsing formats commonly used in archives is a slightly less popular problem to solve. Archives tend not to have an enforced, fixed way of writing down dates, so there can be a surprising variety of notations. This isn’t a bad thing — it gives the archivists the flexibility they need to represent their knowledge about the objects under their care. Each collection might have its own quirks.

Some smaller software libraries do take on a challenge of parsing dates from more natural, human-readable formats, but we decided to devise our own way. We had a very specific set of formats and couldn’t find an existing solution that could deal with all of them easily.

I wrote DateRanger, a Ruby library which takes in those formats and translates them into a data structure which represents the start and end of the date range. It makes it straightforward to understand the accuracy of the date — the wider the range, the less accurate or specific the date was to begin with. We’d love to see contributions from anyone interested in expanding how many formats DateRanger can work with.

I used automated tests to build up the code in stages, starting from parsing really simple dates, and culminating at testing even obscure format combinations that we didn’t quite encounter in our data sample. The tests-first approach meant I managed to notice and catch some pretty confusing bugs really early on.

We used DateRanger on the Postal Museum touch table, to help us determine where on the timeline to place the collection records. We did, however, use the original date formats from the archive to display to the viewer. After all, those are already perfectly human-readable.

How big is that?

As part of the work we’ve done on the new Waddesdon Bequest Explorer with the British Museum, we made a widget which depicts the volume of an object (as a cuboid), next to a tennis ball for scale, and thought that other people might be able to use it. So, we’ve extracted it into a public Github repository called dimension-drawer.

This seems like a small thing, but it can be hard to get a sense of how big a museum object is, when looking at its photo online – especially when that photo is artfully shot on a black background.

Luckily for us, ‘dimensions’ of an object are one bit of metadata that museums routinely store across their entire collection. I suspect this is mainly for practical purposes (“how big a box will I need to transport this object in?”), as there would will little point in displaying it on a label in gallery, when you can look at the real thing.

The tool outputs the drawing as SVG (Scalable Vector Graphics), an XML based format which works in most modern browsers (even Internet Explorer). You can even style it using regular ol’ CSS.

The cuboid in the diagram is drawn using the ‘Cabinet Projection’, which is a sort of fake 3D with parallel lines (instead of a vanishing point). This made the Maths easier (I dusted off my school-age memory of Trigonometry), and also seemed like a pleasing throwback to the age of the collection. (Cabinet Projection was traditionally in technical drawing by furniture makers).

Why a tennis ball? The size was right for our purposes (some of the objects are smaller, some bigger), we thought a tennis ball would be universally recognisable, and it’s simple to draw. We’ve already had a request for the option of displaying a rugby ball instead though!

You can see how it looks for a collection of almost 300 objects in the new Waddesdon Bequest collection explorer we made for the British Museum.

We’ve published the Ruby code behind the tool as a gem, so that other people can use it. There are instructions on the GitHub page.

I’m sure there’s lots of ways the code could be improved, or new features added, so if you have any ideas, get in touch.

Sketching and Engineering

This is a guest post from Tom Armitage, our collaborator on the V&A Spelunker. It’s our second internal R&D project, and we released it last week.

Early on in the process of making the V&A Spelunker – almost a few hours in – I said to George something along the lines of “I’m really trying to focus on sketching and not engineering right now“. We ended up discussing that comment at some length, and it’s sat with me throughout the project. And it’s what I wanted to think about a little now that the Spelunker is live.

For me, the first phase of any data-related project is material exploration: exploring the dataset, finding out what’s inside it, what it affords, and what it hints at. That exploration isn’t just analytical, though: we also explore the material by sketching with it, and seeing what it can do.

The V&A Spelunker is an exploration of a dataset, but it’s also very much a sketch – or a set of sketches – to see what playing with it feels like: not just an analytical understanding of the data, but also a playful exploration of what interacting with it might be like.

Sketching is about flexibility and a lack of friction. The goal is to get thoughts into the world, to explore them, to see what ideas your hand throws up autonomously. Everything that impedes that makes the sketching less effective. Similarly, everything that makes it hard to change your mind also makes it less effective. It’s why, on paper, we so often sketch with a pencil: it’s easy to rub out and change our mind with, and it also (ideally) glides easily, giving us a range of expression and tone. On paper, we move towards ink or computer-based design as our ideas become more permanent, more locked. Those techniques are often slower to change our minds about, but they’re more robust – they can be reproduced, tweaked, published.

Code is a little different: with code, we sketch in the final medium. The sketch is code, and what we eventually produce – a final iteration, or a production product – will also be code.

As such, it’s hard to balance two ways of working with the same material. Working on the Spelunker, I had to work hard to fight the battle against premature optimisation. Donald Knuth famously described premature optimisation as ‘the root of all evil‘. I’m not sure I’d go that far, but it’s definitely an easy put to fall into when sketching in code.

The question I end up having to answer a lot is: “when is the right time to optimise?” Some days, even in a sketch, optimisation is the right way to go. If we want to find out how many jumpers there are in the collection – well, that’s just a single COUNT query; it doesn’t matter if it’s a little slow.

I have to be doubly careful of premature optimisation when collaborating, and particularly sketching, and remember that not every question or comment is a feature request. My brain often runs off of its own accord, wondering whether I should write a large chunk of code, when really, the designer in me should be just thinking about answering that question. The software-developer part of my brain ought to kick in later, when the same question has come up a few times, or when it turns out the page to answer that question is going to see regular use.

For instance, the Date Graph is also where the performance trade-offs of the Spelunker are most obvious. By which I mean: it’s quite slow.

Why is it slow?

I’d ingested the database we’d been supplied as-is, and just built code on top of it. I stored it in a MySQL database simply because we’d been given a MySQL dump. I made absolutely no decisions: I just wanted to get to data we could explore as fast as possible.

All the V&A’s catalogue data – the exact dataset we had in the MySQL dump – is also available through their excellent public API. The API returns nicely structured JSON, too, making an object’s relationships to attributes like what it’s made of really clear. A lot of this information wasn’t readily available in the MySQL database. The materials relations, for instance, had been reduced to a single comma-separated field – rather than the one-to-many relationship to another table that would have perhaps made more sense.

I could have queried the API to get the shape of the relationships – and if we were building a product focused around looking up a limited number of objects at a time, the API would have been a great way to build on it. But to begin with, we were interested in the shape of the entire catalogue, the birds’ eye view. The bottleneck in using the API for this would be the 1.1 million HTTP requests – one for each item; we’d be limited by the speed of our network connection, and perhaps even get throttled by the API endpoint. Having a list of the items already, in a single place – even if it was a little less rich – was going to be the easiest way to explore the whole dataset.

The MySQL database would be fine to start sketching with, even if it wasn’t as rich as the structured JSON. It was also a little slow due to the size of some of the fields – because the materials and other facets were serialized into single fields, they were often quite large field types such as LONGTEXT, which were slow to query against. Fine for sketching, but it’s not necessarily very good for production in the long-term – and were I to work further on this dataset, I think I’d buckle and either use the API data, or request a richer dump from the original source.

I ended up doing just enough denormalizing to speed up some of the facets, but that was about it in terms of performance optimisation. It hadn’t seemed worthwhile to optimize the database at that point until I knew the sort of questions we want answered.

That last sentence, really, is a better answer to the question of why it is slow.

Yes, technically, it’s because the database schema isn’t quite right yet, or because there’s a better storage platform for that shape of data.

But really, the Spelunker’s mainly slow because it began as a tool to think with, a tool to generate questions. Speed wasn’t our focus on day one of this tiny project. I focused on getting to something that’d lead to more interesting questions rather than something that was quick. We had to speed it up both for our own sanity, and so that it wouldn’t croak when we showed anybody else – both of which are good reasons to optimise.

The point the Spelunker is right now turns out to be where those two things were in fine balance. We’ve got a great tool for thinking and exploring the catalogue, and it’s thrown up exactly the sort of questions we hoped it would. We’ve also begun to hit the limits of what the sketch can do without a bit more ground work: a bit more of the engineering mindset, moving to code that resembles ink rather than pencil.

“Spelunker” suggest a caving metaphor: exploring naturally occurring holes. Perhaps mining is a better metaphor, and the balance that needs to be struck digging your own hole in the ground. The exploration, the digging, is exciting, and for a while, you can get away without supporting the hole. And then, not too early, and ideally not too late, you need to swap into another other mode: propping up the hole you’ve dug. Doing the engineering necessary to make the hole robust – and to enable future exploration. It’s a challenge to do both, but by the end, I think we struck a reasonable balance in the process of making the V&A Spelunker.

If you’re an institution thinking about making your catalogue available publicly:

  • API access and data dumps are both useful to developers depending on the type of work they’re doing. Data dumps are great for getting a big picture. They can vastly reduce traffic against your API. But a rich API is useful for integrating into existing platforms, especially if they make relatively few queries per page against your API (and if you have a suitable caching strategy in place). For instance, an API is the ideal interface great for supplying data about a few objects to a single page somewhere else on the internet (such as a newspaper article, or an encyclopedia page).
  • If you are going to supply flat dumps, do make sure those files are as rich as the API. Try not to flatten structure or relationships that’s contained in the catalogue. That’s not just to help developers write performant software faster; it’s also to help them come to an understanding of the catalogue’s shape.
  • Also, do use the formats of your flat dump files appropriately. Make sure JSON objects are of the right type, rather than just lots of string; use XML attributes as well as element text. If you’re going to supply raw data dumps from, say, an SQL database, make sure that table relations are preserved and suitable indexes already supplied – this might not be what your cataloguing tool automatically generates!
  • Make sure to use as many non-proprietary formats as possible. A particular database’s form of SQL is nice for developers who use that software, but most developers will be at least as happy slurping JSON/CSV/XML into their own data store of choice. You might not be saving them time by supplying a more complex format, and you’ll reach a wider potential audience with more generic formats.
  • Don’t assume that CSV is irrelevant. Although it’s not as rich or immediately useful as structured data, it’s easily manipulable by non-technical members of a team in tools such as Excel or OpenRefine. It’s also a really good first port of call for just seeing what’s supplied. If you are going to supply CSV, splitting your catalogue into many smaller files is much preferable to a single, hundreds-of-megabytes file.
  • “Explorer” type interfaces are also a helpful way for a developer to learn more about the dataset before downloading it and spinning up their own code. The V&A Query Builder, for instance, already gives a developer a feel for the shape of the data, what building queries looks like, and clicking through to the full data for a single object.
  • Documentation is always welcome, however good your data and API! In particular, explaining domain-specific terms – be they specific to your own institution, or to your cataloguing platform – is incredibly helpful; not all developers have expert knowledge of the cultural sector.
  • Have a way for developers to contact somebody who knows about the public catalogue data. This isn’t just for troubleshooting; it’s also so they can show you what they’re up to. Making your catalogue available should be a net benefit to you, and making sure you have ways to capitalize on that is important!