Michael Pidd overseas the work of HRI Digital at the University of Sheffield’s Humanities Research Institute. He has been working on the Digital Panopticon project, which involves bringing together a large body of third-party data, and believes there is still a considerable gap between the aspirations of open data policy and the reality of data access, particularly in the humanities where there is a trend for publicly-created data to become commercially-curated.

The aim of the Digital Panopticon is to explore the impact of different types of punishments on people sentenced at the Old Bailey between 1780 and 1875, the period of transportation to penal colonies in Australia. It is funded by the Arts & Humanities Research Council (AHRC), and involves the universities of Liverpool, Sheffield, Tasmania, Oxford and Sussex.

It brings together genealogical, biometric and criminal justice datasets held in the UK and Australia in order to trace the life courses of 90,000 convicts using nominal record linkage and data visualisation techniques. The data comes from two types of sources: publicly available datasets that have been created by academics as part of previously funded research projects; datasets that can only be accessed via commercial content providers such as Ancestry.com and Find My Past.

The quantity of Ancestry and Find My Past data the project was interested in is considerable – four million items, compared with 1.2 million items from open access sources. We wanted permission to process all of their data as part of our record linkage work and then make the data freely available on our website for other people to use. However, the business models of Find My Past and Ancestry.com are predicated on subscription-based access, so they were not going to let us do this.

Negotiations took at least 18 months, and we eventually agreed that each company will permit us to process the datasets in their entirety, but only make parts of the data publicly accessible. In other words, users will be able to search and visualise the datasets in their entirety, but the text of each individual data record will be restricted to specific categories of metadata, such as displaying the prisoner’s name, offence and year of conviction only. Anyone wishing to view a data record in its entirety will be directed to the subscription-only Ancestry.com or Find My Past website. In effect, we will become a point of sale for both companies, potentially introducing new customers to their products.

Cost recovery models present problems for any open access agenda

This was a good result for us. But it’s worth contemplating why we had to approach them in the first place, given all these datasets are digitised versions of collections in The National Archives (TNA). TNA digitised these collections at its own cost, generated summary metadata, and then agreed exclusive licences with Ancestry and Find My Past. These licences permit both companies to give users paid-for access to the images. Find My Past and Ancestry have subsequently produced a re-keyed version of the images at their own cost, creating searchable metadata that assists users who are trying to locate specific records.

TNA, Ancestry and Find My Past are not alone in this type of licensing arrangement. Many online resources essential to humanities research have been digitised by our public libraries and archives and then licensed to commercial content providers. For example, digitisation of the British Library Newspapers was funded by Jisc and then the images were licensed to Gale Cengage. Gale generated searchable OCR text from the images and now charge institutional subscriptions for access to the combined product. It’s a similar story for historical UK census data. I am not necessarily critical of these types of relationships or attempts to recover cost, but it does present problems for any open access agenda.

The problem for humanities researchers is that we rely on access to physical collections in libraries, archives and repositories. This means that before we can interrogate them digitally somebody has to pay for them to be digitised. The digitisation model often adopted by some of our largest public libraries and archives involves cost recovery through subsequent licensing arrangements.

This effectively locks huge swathes of source material behind commercial paywalls. This is not too problematic for the family historian or traditional scholar interested in locating a few, specific documents. But for a historian who might want to identify trends and anomalies across large bodies of data, this is a big problem (the big problem with big data in the humanities).

Complex value chain created by digitisation and licensing ‘not being addressed’

Such licences are not perpetual, and so one could argue that there is a point when this data could be returned to the public domain or accessed through alternative channels. However, the IP that companies such as Ancestry and Find My Past have added to the images through re-keying, makes their products far more valuable than the images alone. Could TNA ever provide a comparable product? If the images were released into the public domain, could accompanying public domain transcriptions emerge?

My involvement in open data activities over the past few years has left me surprised that the complex value chain created by digitisation and licensing is not being addressed. Current open data policy initiatives seem to have an overwhelming focus on STEM research which can have quite different challenges (this focus is understandable, given the economic potential of open science).

Even the EU’s FP7 RECODE project, which sought to include a humanities case study, opted for archaeology which one could argue is unrepresentative of the humanities disciplines in terms of its data. Digital data in archaeology is often captured on-site by the archaeologists themselves, and research funders seem comfortable with funding this type of digitisation. This means that archaeologists usually own their data. The impact on the discipline has been a long, admirable tradition of data re-use and sharing through established open data repositories such as the Archaeology Data Service and the International Tree-Ring Data Bank.

The money elephant in the room

This is not the case for other types of humanities research which rely on collections of primary sources that are held in our public libraries and archives. These collections can be on such a scale that research funders can be uncomfortable with meeting the cost of their digitisation. So when libraries and archives digitise them at their own cost, they are compelled to seek to recover those costs through licensing arrangements with commercial content providers.

It always comes down to the money. Unless the underlying problem of digitisation of our public heritage is addressed (who pays for it and should public institutions recover their costs by selling it), open data policy will have little impact on the research data that really matters for humanists.

For the humanities, this is the elephant in the room during most open data policy events. Fortunately, my experience of working with companies such as Find My Past, Ancestry, ProQuest and Gale Cengage has been rewarding. They have always been willing to facilitate the type of access and use of data that we desire while not compromising their business model. But I wonder if they would remain this accommodating if our kinds of requests were to become commonplace from academics.

Michael Pidd is the HRI Digital Director at the Humanities Research Institute, University of Sheffield. His role includes providing strategic direction in developing its research base through partnerships with the research community in public and commercial organisations, initiating knowledge exchange opportunities on behalf of the faculty, and overseeing product/system design, build and delivery by the technical team.