Archive

Author Archives: frankieroberto

In part one of this two-part series about how we went about building an Alpha website for Wellcome Library, we looked at how we turned ‘subject headings’ into webpages.

This post looks at the second major type of aggregation pages we settled on: people.

At first we were tempted to refer to these as ‘authors’, using the language of books, but of course the library isn’t just books, and so sometimes the people might be editors, collaborators, artists (the library has an art collection too), scientists credited on academic papers, and so on.

Within the MARC metadata we were given, people are referenced mostly in the 100 field (‘Main Entry–Personal Name’), but also in the 700 field (‘Added Entry-Personal Name’). As far as we could make out, there’s only ever one person in the 100 field (with only a couple of exceptions), but there could be many in the 700. It wasn’t clear to us what the semantic difference was, so we took the decision to merge them all together.

Each person field contains a bunch of sub-fields for the person’s name, title (Mr, Mrs, Sir, etc) and dates (normally just birth and/or death), as well as some other lesser-used sub-fields like ‘numeration’ (e.g. the ‘II’ in Pope John Paul II) and ‘attribution qualifier’ (used for describing someone as the ‘pupil of’ an artist, when the actual artist is unknown).

One awkward stumbling block was that the name of the person followed the library tradition of being in ‘surname-comma-firstnames’ format. This convention makes it easy for computer systems to sort by surname, which historically has probably been the most useful order for readers. But we felt strongly that it is the least user-friendly way of actually reading people’s names, as it inverts the natural order of the way we pronounce people’s full names (no-one talks about ‘Hawking Steven’, but ‘Steven Hawking’ is a household name). Switching the order back sounds like a simple task (split the string at the last comma, then reverse the order), and mostly is, but there are always exceptions – and where we encounter strings like “Peter, of Celle, Bishop of Chartres,ca”, it’s a bit harder to turn these back into more readable names.

With our goal being to make the library catalogue browsable (rather than just searchable), our next task was to find ways to enrich the information about the people in the database, helping readers to find out more about them (which may in turn shed some light on what the content of the book is likely to be).

Like with subjects, many of the 100 and 700 people fields contain an ID linking the person to an external authority file. Unlike with subjects though, we only encountered  a single authority file in use: the Library of Congress Name Authority.

Where they existed, we could use these IDs to make sure that multiple books by the same person would appear on the same single person page, even if their name was spelt out or punctuated differently on the different records.

It would have been tempting to use these Library of Congress IDs within the URL structure of the Alpha site. But because they weren’t always present (either because that person isn’t in the LOC authority file, or just because the record has been matched up), we couldn’t do that, and so minted our own IDs instead. For simplicity’s sake, these are simple numbers, but preceded by the letter ‘P’ (for person).

We discovered an existing project called VIAF, which aims to link together name authority files from many different institutions across the globe. By querying this database with the Library of Congress IDs, we collected up all the other IDs that were available. This means we can construct links from the people pages on the Wellcome Library website to the equivalent pages on other catalogues, such as the national libraries of France, Germany, Spain, Canada, and many more.

Pleasingly, VIAF has also collected IDs referencing Wikipedia pages. As Wikipedia allows others to uses its content under a Creative Commons licence, we could query the site (using its API) and display the content on our person pages. We decided to display the first two sentences (with a link to Wikipedia to read the full biography), on the basis that that’s usually enough information to get a sense of what the person is mostly known for. We also removed any text from Wikipedia in parentheses, as these are normally dates (which we show elsewhere), a pronunciation guide to their name, or other minor details that weren’t needed for a quick read.

As well as text, we also collected the images from the Wikipedia page, and use the first one (if there are any) within a circle to illustrate the person on both their person page and aggregation pages. This mostly works – where it’s a photo or drawing of the person, or even if it’s a scan of one of their works – but does sometimes show a slightly misleading image.

There was a small amount of concern over using Wikipedia as a source of content (although most were positive). One issue is what might happen if we pull the content from Wikipedia at a point in time when that page has been vandalised. We could mitigate that to some extent by regularly updating our content on a rolling schedule (and relying on the community to resolve) – but to allow for any major issues to be resolved more quickly than that, we added an admin feature to immediately refresh the content from Wikipedia. So if someone at Wellcome spots a page where the Wikipedia introduction is inaccurate or contains vandalised content, they can fix it on Wikipedia itself, and then have those changes reflected on the Wellcome Library page.

As well as the Wikipedia intro, we added a feature allowing Wellcome staff to add a separate intro to be displayed alongside it. Our rule of thumb here was that this intro should be specific to the Wellcome institution, rather than repeating the sort of general information that might be on a Wikipedia biography. So things like that person’s relationship to Wellcome (e.g. if it’s Henry Wellcome himself) or noting what sort of material from that person was available at the Wellcome Library (which could be quite a lot, if it’s one of the people whose personal archives are held there).

After these context-setting introductions and photo, we display some data about that person collected from the catalogue itself: things like the subjects their works are mostly about, a timeline of when their works were created/published and what format their works are mostly in. More experimentally, we tried displaying some links to other people who are the “contemporaries” of that person. This query changed a few times as we tried to refine it, and ended up being something along the lines of “people who have produced works about the some of same subjects and who were born within 10 years”. It sometimes works well, sometimes doesn’t.

Finally, we added the ability to highlight ‘interesting’ people to appear on the homepage.

Our last and most recent step was to go back and use an additional type of metadata that we originally missed: field 600 which contains people, but who are the subject of a work rather than its creator. Pleasingly for these ‘person-as-subject’ pages we could re-use the simple URL structure for subject pages (/subjects/S1234) but replacing the S-number for the person’s P-number. (One key benefit of differentiating your IDs for different types of things).

As part of building the Wellcome Library Alpha, one issue we had to grapple with was ‘subjects’. We knew we wanted these to be a core part of the browsing and discovery experience, as these a crucial to understanding what the collection is about.

However, ‘subjects’ have a long and many varied history within the world of libraries. Fundamentally this is because, unlike the data about a book’s title, authors, page count and so on, all of which are actually printed in the book, a book’s subjects are subjective.

You could imagine a system whereby a librarian who is cataloging a book gets to write a whole paragraph of carefully considered prose about what a book is about. Actually you don’t have to imagine: this is pretty much what art curators do.

But whilst a paragraph of prose accurately describing what a specific book is about would be super useful once you’d got to an individual catalogue record, it’s less useful for searching (not to mention that librarians probably don’t have the time).

So instead, libraries use lists of subject terms (which are called ‘headings’ – because they were once headings on actual pieces of card).

These terms can be ‘controlled’ – i.e. only a limited set can be used, with control over adding/removing terms held by some group, or ‘uncontrolled’, in which case new terms can be made up on the spot at the point of cataloguing.

Wellcome Library uses a mixture of these. Some subjects are entered as free text, with any consistency down the individual cataloger. Others are referenced against an external controlled vocabulary.

And there isn’t just one external list of subjects in place – there are many. The main two are MeSH, which is Medicine-specific and controlled by the U.S National Library of Medicine, and LCSH, controlled by the U.S Library of Congress. Other minor vocabularies in use include one designed for use in Children’s Libraries.

Some of the differences between these subject vocabularies are pretty minor: things like capitalisation, pluralisation, or the presence of an extra full stop at the end of the phrase. These don’t matter too much if your main interface is search (so long as your search engine can support fuzzy matches), but we wanted to be able to show things like the top subjects across the collection.

So we spent quite a bit of time merging these subjects together. It’s a big job though – whilst we could handle the minor differences automatically, others require manual intervention (such as knowing that “World War II” and “1939-1945 World War” refer to the same event).

Both MeSH subjects and LOC subjects have IDs within those schemes. Because we’re merging them together though – and because there are also plenty of free-text subjects within the Wellcome Library catalogue – we minted a new Wellcome-specific identifier for subjects, the ‘S-number’ (visible in the URL). However we retain the IDs within other schemes as concordances, and they’re listed at the bottom of each subject page.

Finally, the controlled subject vocabularies aren’t always just flat lists of terms. In the case of MeSH, the terms are organised into hierarchies, and each term also has a list of synonyms and a ‘scope note’, which is a sort-of description of the subject (albeit probably written more to aid catalogers than library users). We imported all of this extra metadata, making use of the synonyms within search, the hierarchies for browse, and the scope notes for context. They’re all a bit weird. Within the MeSH hierarchy, the subject ‘Thumb’ is buried deep within ‘Body Regions’, within ’Fingers’ within ‘Hand’ within ‘Upper Extremity’ within ‘Extremities’, but ‘Breast’ and ‘Perineum’ are immediate children. And ‘milk’ is described in the scope notes as ‘The white liquid secreted by the mammary glands. It contains proteins, sugar, lipids, vitamins, and minerals.’

Treating subjects as top-level objects within our design and database structure, rather than just attributes of record, gives us a place to add extra editorial content too. So editors can add a Wellcome-specific introductions, including links to relevant blog posts. For an example see ‘Beards’. We also added a way to flag subjects as ‘interesting’, so that we could show a selection of those on the homepage and subject pages.

The aggregated set of things about a given subject also help to describe it. For each subject page, we calculate and show the people who’ve written most about that subject, the types of things that the subject mostly contains, and the other subjects that the subject is often seen with (‘co-occurances’).

The list of co-occuring subjects presents an interesting design challenge: do we just link to those subjects, letting you go ‘sideways’, or link to a page showing things tagged with both subjects? For the time being, we do both – the latter link signified with a ‘+’.

What next? We’d like to link subjects to Wikipedia pages, and any other sources that might be interesting or useful (news topics? international disease classifications? geonames?). Some of these mappings may even already exist (the US National Library of Medicine has a metathesaurus which looks promising).

There’s also some extra structure in the MeSH headings that we deliberately flattened: ‘qualifiers’. These allow subjects to be further narrowed by the addition of phrases like ‘prevention of’ or ‘adverse effects’. Our process of flattening means that where we encountered these qualified subjects, we tagged the item with both the qualified and un-qualified subject. This feels to use like the right trade-off of simplicity to expressiveness – but we could decide to retain some relationship between the terms, so that we can at least link between them.

Finally, the next logical step in improving the usefulness of the subjects metadata is to add an interface to allow editors – and perhaps even any external researcher? – to easily add subjects to an item.

As part of the work we’ve done on the new Waddesdon Bequest Explorer with the British Museum, we made a widget which depicts the volume of an object (as a cuboid), next to a tennis ball for scale, and thought that other people might be able to use it. So, we’ve extracted it into a public Github repository called dimension-drawer.

This seems like a small thing, but it can be hard to get a sense of how big a museum object is, when looking at its photo online – especially when that photo is artfully shot on a black background.

Luckily for us, ‘dimensions’ of an object are one bit of metadata that museums routinely store across their entire collection. I suspect this is mainly for practical purposes (“how big a box will I need to transport this object in?”), as there would will little point in displaying it on a label in gallery, when you can look at the real thing.

The tool outputs the drawing as SVG (Scalable Vector Graphics), an XML based format which works in most modern browsers (even Internet Explorer). You can even style it using regular ol’ CSS.

The cuboid in the diagram is drawn using the ‘Cabinet Projection’, which is a sort of fake 3D with parallel lines (instead of a vanishing point). This made the Maths easier (I dusted off my school-age memory of Trigonometry), and also seemed like a pleasing throwback to the age of the collection. (Cabinet Projection was traditionally in technical drawing by furniture makers).

Why a tennis ball? The size was right for our purposes (some of the objects are smaller, some bigger), we thought a tennis ball would be universally recognisable, and it’s simple to draw. We’ve already had a request for the option of displaying a rugby ball instead though!

You can see how it looks for a collection of almost 300 objects in the new Waddesdon Bequest collection explorer we made for the British Museum.

We’ve published the Ruby code behind the tool as a gem, so that other people can use it. There are instructions on the GitHub page.

I’m sure there’s lots of ways the code could be improved, or new features added, so if you have any ideas, get in touch.