A debate seems to come up when folks in charge of organizing digital collections get together: standardized schema such as the Library of Congress Subject Headings are annoying to read, outdated, and rigidly hierarchical, they’re better than tags in one way — they’re organized.
And while that might be okay for library books, musem archives, and other lovingly-put-together collections, what do we do with other, more organic collections like the flickr images, all the twitter posts you’re trying to follow, or the thousands of recipes on allrecipes.com? One could imagine sitting down and creating a classification schema that would encompass each of these domains, but that would be ignoring the impossible task of assigning each blog, image, or recipe a place in the organization.
Enter Castanet (research papers here), a tool that automatically creates browsing structures from whatever metadata or data happens to be in a collection. Of course, like the topic-modeling tool LDA (an emerging favorite for humanities researchers wishing to exploit natural language processing technologies, examples here and here) , the results aren’t perfect, but they’re actually not a bad place to start. Castanet automatically carves a sub-structure from the hierarchical concept dictionary, WordNet (http://wordnet.princeton.edu), and matches items in the collection to one or many appropriate places within that hierarchy. Then, after some automated trimming and flattening, the result is a hierarchical browsing system.
It can be used with any kind of metadata. Last summer, for example, I used the algorithm to create this category system for the Flickr Commons images, just using the image tags (sometimes the link doesn’t work, check out Castanet on other collections). The category system isn’t ideal – the names of the categories are a little weird at times, and perhaps a curator would want to organize the items differently, but these operations – renaming, moving, reclassifying, are much easier than manually creating the hierarchy in the first place.
While browsing structures are a lot less sexy than automated topic models, I think that Castanet could reduce the time cost of creating nice browsing interfaces to otherwise hard-to-navigate digital collections – a step towards making them more navigable and easy-to-use for the humanities researchers that depend on them.