This is a guest post from Tom Armitage, our collaborator on the V&A Spelunker. It’s our second internal R&D project, and we released it last week.
Early on in the process of making the V&A Spelunker – almost a few hours in – I said to George something along the lines of “I’m really trying to focus on sketching and not engineering right now“. We ended up discussing that comment at some length, and it’s sat with me throughout the project. And it’s what I wanted to think about a little now that the Spelunker is live.
For me, the first phase of any data-related project is material exploration: exploring the dataset, finding out what’s inside it, what it affords, and what it hints at. That exploration isn’t just analytical, though: we also explore the material by sketching with it, and seeing what it can do.
The V&A Spelunker is an exploration of a dataset, but it’s also very much a sketch – or a set of sketches – to see what playing with it feels like: not just an analytical understanding of the data, but also a playful exploration of what interacting with it might be like.
Sketching is about flexibility and a lack of friction. The goal is to get thoughts into the world, to explore them, to see what ideas your hand throws up autonomously. Everything that impedes that makes the sketching less effective. Similarly, everything that makes it hard to change your mind also makes it less effective. It’s why, on paper, we so often sketch with a pencil: it’s easy to rub out and change our mind with, and it also (ideally) glides easily, giving us a range of expression and tone. On paper, we move towards ink or computer-based design as our ideas become more permanent, more locked. Those techniques are often slower to change our minds about, but they’re more robust – they can be reproduced, tweaked, published.
Code is a little different: with code, we sketch in the final medium. The sketch is code, and what we eventually produce – a final iteration, or a production product – will also be code.
As such, it’s hard to balance two ways of working with the same material. Working on the Spelunker, I had to work hard to fight the battle against premature optimisation. Donald Knuth famously described premature optimisation as ‘the root of all evil‘. I’m not sure I’d go that far, but it’s definitely an easy put to fall into when sketching in code.
The question I end up having to answer a lot is: “when is the right time to optimise?” Some days, even in a sketch, optimisation is the right way to go. If we want to find out how many jumpers there are in the collection – well, that’s just a single COUNT query; it doesn’t matter if it’s a little slow.
I have to be doubly careful of premature optimisation when collaborating, and particularly sketching, and remember that not every question or comment is a feature request. My brain often runs off of its own accord, wondering whether I should write a large chunk of code, when really, the designer in me should be just thinking about answering that question. The software-developer part of my brain ought to kick in later, when the same question has come up a few times, or when it turns out the page to answer that question is going to see regular use.
For instance, the Date Graph is also where the performance trade-offs of the Spelunker are most obvious. By which I mean: it’s quite slow.
Why is it slow?
I’d ingested the database we’d been supplied as-is, and just built code on top of it. I stored it in a MySQL database simply because we’d been given a MySQL dump. I made absolutely no decisions: I just wanted to get to data we could explore as fast as possible.
All the V&A’s catalogue data – the exact dataset we had in the MySQL dump – is also available through their excellent public API. The API returns nicely structured JSON, too, making an object’s relationships to attributes like what it’s made of really clear. A lot of this information wasn’t readily available in the MySQL database. The materials relations, for instance, had been reduced to a single comma-separated field – rather than the one-to-many relationship to another table that would have perhaps made more sense.
I could have queried the API to get the shape of the relationships – and if we were building a product focused around looking up a limited number of objects at a time, the API would have been a great way to build on it. But to begin with, we were interested in the shape of the entire catalogue, the birds’ eye view. The bottleneck in using the API for this would be the 1.1 million HTTP requests – one for each item; we’d be limited by the speed of our network connection, and perhaps even get throttled by the API endpoint. Having a list of the items already, in a single place – even if it was a little less rich – was going to be the easiest way to explore the whole dataset.
The MySQL database would be fine to start sketching with, even if it wasn’t as rich as the structured JSON. It was also a little slow due to the size of some of the fields – because the materials and other facets were serialized into single fields, they were often quite large field types such as LONGTEXT, which were slow to query against. Fine for sketching, but it’s not necessarily very good for production in the long-term – and were I to work further on this dataset, I think I’d buckle and either use the API data, or request a richer dump from the original source.
I ended up doing just enough denormalizing to speed up some of the facets, but that was about it in terms of performance optimisation. It hadn’t seemed worthwhile to optimize the database at that point until I knew the sort of questions we want answered.
That last sentence, really, is a better answer to the question of why it is slow.
Yes, technically, it’s because the database schema isn’t quite right yet, or because there’s a better storage platform for that shape of data.
But really, the Spelunker’s mainly slow because it began as a tool to think with, a tool to generate questions. Speed wasn’t our focus on day one of this tiny project. I focused on getting to something that’d lead to more interesting questions rather than something that was quick. We had to speed it up both for our own sanity, and so that it wouldn’t croak when we showed anybody else – both of which are good reasons to optimise.
The point the Spelunker is right now turns out to be where those two things were in fine balance. We’ve got a great tool for thinking and exploring the catalogue, and it’s thrown up exactly the sort of questions we hoped it would. We’ve also begun to hit the limits of what the sketch can do without a bit more ground work: a bit more of the engineering mindset, moving to code that resembles ink rather than pencil.
“Spelunker” suggest a caving metaphor: exploring naturally occurring holes. Perhaps mining is a better metaphor, and the balance that needs to be struck digging your own hole in the ground. The exploration, the digging, is exciting, and for a while, you can get away without supporting the hole. And then, not too early, and ideally not too late, you need to swap into another other mode: propping up the hole you’ve dug. Doing the engineering necessary to make the hole robust – and to enable future exploration. It’s a challenge to do both, but by the end, I think we struck a reasonable balance in the process of making the V&A Spelunker.
If you’re an institution thinking about making your catalogue available publicly:
- API access and data dumps are both useful to developers depending on the type of work they’re doing. Data dumps are great for getting a big picture. They can vastly reduce traffic against your API. But a rich API is useful for integrating into existing platforms, especially if they make relatively few queries per page against your API (and if you have a suitable caching strategy in place). For instance, an API is the ideal interface great for supplying data about a few objects to a single page somewhere else on the internet (such as a newspaper article, or an encyclopedia page).
- If you are going to supply flat dumps, do make sure those files are as rich as the API. Try not to flatten structure or relationships that’s contained in the catalogue. That’s not just to help developers write performant software faster; it’s also to help them come to an understanding of the catalogue’s shape.
- Also, do use the formats of your flat dump files appropriately. Make sure JSON objects are of the right type, rather than just lots of string; use XML attributes as well as element text. If you’re going to supply raw data dumps from, say, an SQL database, make sure that table relations are preserved and suitable indexes already supplied – this might not be what your cataloguing tool automatically generates!
- Make sure to use as many non-proprietary formats as possible. A particular database’s form of SQL is nice for developers who use that software, but most developers will be at least as happy slurping JSON/CSV/XML into their own data store of choice. You might not be saving them time by supplying a more complex format, and you’ll reach a wider potential audience with more generic formats.
- Don’t assume that CSV is irrelevant. Although it’s not as rich or immediately useful as structured data, it’s easily manipulable by non-technical members of a team in tools such as Excel or OpenRefine. It’s also a really good first port of call for just seeing what’s supplied. If you are going to supply CSV, splitting your catalogue into many smaller files is much preferable to a single, hundreds-of-megabytes file.
- “Explorer” type interfaces are also a helpful way for a developer to learn more about the dataset before downloading it and spinning up their own code. The V&A Query Builder, for instance, already gives a developer a feel for the shape of the data, what building queries looks like, and clicking through to the full data for a single object.
- Documentation is always welcome, however good your data and API! In particular, explaining domain-specific terms – be they specific to your own institution, or to your cataloguing platform – is incredibly helpful; not all developers have expert knowledge of the cultural sector.
- Have a way for developers to contact somebody who knows about the public catalogue data. This isn’t just for troubleshooting; it’s also so they can show you what they’re up to. Making your catalogue available should be a net benefit to you, and making sure you have ways to capitalize on that is important!