As part of building the Wellcome Library Alpha, one issue we had to grapple with was ‘subjects’. We knew we wanted these to be a core part of the browsing and discovery experience, as these a crucial to understanding what the collection is about.
However, ‘subjects’ have a long and many varied history within the world of libraries. Fundamentally this is because, unlike the data about a book’s title, authors, page count and so on, all of which are actually printed in the book, a book’s subjects are subjective.
You could imagine a system whereby a librarian who is cataloging a book gets to write a whole paragraph of carefully considered prose about what a book is about. Actually you don’t have to imagine: this is pretty much what art curators do.
But whilst a paragraph of prose accurately describing what a specific book is about would be super useful once you’d got to an individual catalogue record, it’s less useful for searching (not to mention that librarians probably don’t have the time).
So instead, libraries use lists of subject terms (which are called ‘headings’ – because they were once headings on actual pieces of card).
These terms can be ‘controlled’ – i.e. only a limited set can be used, with control over adding/removing terms held by some group, or ‘uncontrolled’, in which case new terms can be made up on the spot at the point of cataloguing.
Wellcome Library uses a mixture of these. Some subjects are entered as free text, with any consistency down the individual cataloger. Others are referenced against an external controlled vocabulary.
And there isn’t just one external list of subjects in place – there are many. The main two are MeSH, which is Medicine-specific and controlled by the U.S National Library of Medicine, and LCSH, controlled by the U.S Library of Congress. Other minor vocabularies in use include one designed for use in Children’s Libraries.
Some of the differences between these subject vocabularies are pretty minor: things like capitalisation, pluralisation, or the presence of an extra full stop at the end of the phrase. These don’t matter too much if your main interface is search (so long as your search engine can support fuzzy matches), but we wanted to be able to show things like the top subjects across the collection.
So we spent quite a bit of time merging these subjects together. It’s a big job though – whilst we could handle the minor differences automatically, others require manual intervention (such as knowing that “World War II” and “1939-1945 World War” refer to the same event).
Both MeSH subjects and LOC subjects have IDs within those schemes. Because we’re merging them together though – and because there are also plenty of free-text subjects within the Wellcome Library catalogue – we minted a new Wellcome-specific identifier for subjects, the ‘S-number’ (visible in the URL). However we retain the IDs within other schemes as concordances, and they’re listed at the bottom of each subject page.
Finally, the controlled subject vocabularies aren’t always just flat lists of terms. In the case of MeSH, the terms are organised into hierarchies, and each term also has a list of synonyms and a ‘scope note’, which is a sort-of description of the subject (albeit probably written more to aid catalogers than library users). We imported all of this extra metadata, making use of the synonyms within search, the hierarchies for browse, and the scope notes for context. They’re all a bit weird. Within the MeSH hierarchy, the subject ‘Thumb’ is buried deep within ‘Body Regions’, within ’Fingers’ within ‘Hand’ within ‘Upper Extremity’ within ‘Extremities’, but ‘Breast’ and ‘Perineum’ are immediate children. And ‘milk’ is described in the scope notes as ‘The white liquid secreted by the mammary glands. It contains proteins, sugar, lipids, vitamins, and minerals.’
Treating subjects as top-level objects within our design and database structure, rather than just attributes of record, gives us a place to add extra editorial content too. So editors can add a Wellcome-specific introductions, including links to relevant blog posts. For an example see ‘Beards’. We also added a way to flag subjects as ‘interesting’, so that we could show a selection of those on the homepage and subject pages.
The aggregated set of things about a given subject also help to describe it. For each subject page, we calculate and show the people who’ve written most about that subject, the types of things that the subject mostly contains, and the other subjects that the subject is often seen with (‘co-occurances’).
The list of co-occuring subjects presents an interesting design challenge: do we just link to those subjects, letting you go ‘sideways’, or link to a page showing things tagged with both subjects? For the time being, we do both – the latter link signified with a ‘+’.
What next? We’d like to link subjects to Wikipedia pages, and any other sources that might be interesting or useful (news topics? international disease classifications? geonames?). Some of these mappings may even already exist (the US National Library of Medicine has a metathesaurus which looks promising).
There’s also some extra structure in the MeSH headings that we deliberately flattened: ‘qualifiers’. These allow subjects to be further narrowed by the addition of phrases like ‘prevention of’ or ‘adverse effects’. Our process of flattening means that where we encountered these qualified subjects, we tagged the item with both the qualified and un-qualified subject. This feels to use like the right trade-off of simplicity to expressiveness – but we could decide to retain some relationship between the terms, so that we can at least link between them.
Finally, the next logical step in improving the usefulness of the subjects metadata is to add an interface to allow editors – and perhaps even any external researcher? – to easily add subjects to an item.