Michael Pidd oversees the work of HRI Digital at the University of Sheffield’s Humanities Research Institute. He says ‘bad data’ is fine while it serves the researcher’s current purpose, but it presents big problems for anyone wishing to re-use it.

In the humanities we use digitisation and digital techniques to understand complex cultural artefacts such as books, manuscripts, documents and visual culture. However, during this process we will often add a further layer of complexity because the digital representations that we create can turn out to be unfaithful, and even idiosyncratic. I am referring to ‘bad data’, whether it be uncorrected OCR, inaccurate re-keying or strange data models. Bad data is fine while ever it serves the researcher’s current purpose, but it presents big problems for anyone wishing to re-use it.

Our new project  has a ‘bad data problem’. The Linguistic DNA Project is funded by the AHRC and involves the universities of Sheffield, Glasgow and Sussex. The aim of the project is to understand the evolution of early modern thought by modelling the semantic and conceptual changes which occurred in English printed discourse between c.1500 and c.1800.  Our datasets comprise Early English Books Online and Eighteenth Century Collections Online, using the following machine-readable versions: EEBO-TCP and ECCO-TCP (created by the Text Creation Partnership) and the OCR version of ECCO (created by Gale Cengage).

The total size of the dataset is estimated at 250,000 texts (37 million pages). Our approach will be to compare each word in the dataset with every other word, using a process of query expansion that will be informed by lexical measures such as frequency, proximity, density, syntax, and semantic relatedness using the Historical Thesaurus of the Oxford English Dictionary. We aim to identify semantic structures formed from multiple related keywords – concepts – and to track their evolution over time with the aid of macroscopic data visualisations. The project is entirely data driven and therefore entirely reliant on the quality of its data. Therein lies the rub.

Historical linguistics hampered by the quality of digitised collections

Identifying concepts and conceptual change across a reliable dataset such as EEBO-TCP will be challenging enough (although EEBO-TCP is not without re-keying issues). For example, spelling variation in this period means that we have to normalise the dataset using spelling regularisation tools such as VARD and MorphAdorner before we can even begin gathering simple statistics such as word frequency. Then there is genre, metaphor and foreign vocabulary to contend with.

Yet this will be trivial when compared with the task of identifying concepts and conceptual change across approximately 32 million pages of uncorrected OCR in ECCO. The Early Modern OCR Project (eMOP) has estimated ECCO’s character accuracy level at 89 percent, equivalent to 3.5 million pages of unreadable text. This means that the accuracy of our algorithms will be seriously disadvantaged by the unreliability of the data itself. In an era when corpus linguistics ought to be superseded by data-driven approaches, historical linguistics remains hampered by the quality of digitised collections.

So how will we overcome the problem of uncorrected OCR in ECCO? First, we hope that the Early Modern OCR Project (eMOP) will produce a better quality OCR version from the original ECCO images. Second, we need to establish the extent to which OCR errors cluster (eg due to poor quality pages) in the hope that we can exclude ‘bad pages’ from the dataset. Further, it might be that more accurate versions of these texts exist due to other re-keying initiatives such as Project Gutenberg, or that we can substitute the 18th century texts if there exists digitised versions of 19th century reprints. We can always live in hope.

At the very least we will need to avoid developing algorithms that set out to correct ECCO or compensate for its problems, because this will simply add further unreliability to the data. Ultimately, re-keying or advancements in image recognition will be the only way to produce an accurate machine-readable ECCO, but I suspect that the Linguistic DNA project will be finished long before then.

Most bad data in the digital humanities is created by ourselves

For Gale Cengage and many of ECCO’s users, this uncorrected OCR serves its primary purpose as a finding aid because most users are interested in locating and viewing the images of specific printed pages, even though 11 percent of the collection cannot be searched reliably. Similarly, the British Library’s British Newspapers 1600-1950 has a very low accuracy rate, but it remains one of the most frequently used online resources according to Jisc. This type of bad data only becomes really problematical when we want to move beyond simple keyword searching and do the sorts of research that were never envisaged by the original digitisation programmes, such as computational language analysis.

However, it is not just large-scale digitisation programmes that produce data that is unfit for  re-use in new types of research. Let us be honest. A great deal of bad data in the Digital Humanities is created by ourselves, whether we be scholars, researchers, developers or technicians.

This became very apparent to us during Jisc’s Connected Histories and Manuscripts Online projects. The aim was to create a federated search service using multiple online datasets. Most of our time was devoted to trying to understand the scope and structure of the datasets in order to harmonise them for consistent search. Few medievalists transcribe Middle English texts using the same protocols, and few early modern historians develop data models that can be understood without copious amounts of documentation.

Taken separately, the datasets were all intelligible, but as soon as we tried bringing them together they became a ragbag of methods, formats and standards. In some instances, a database would be so ‘relational’ that it was almost impossible to determine what constituted an individual ‘record’ from the user’s perspective.

Even our own Old Bailey Online and London Lives websites are not innocent when it comes to bad data structures inhibiting re-use. The use of record linkage algorithms on Old Bailey data as part of the Digital Panopticon project has revealed errors in the original named entity tagging which we have recently had to fix.

Meanwhile, the London Lives website includes a transcription of the Criminal Registers, historical documents that are predominantly tabular in layout. Originally we decided against capturing the structure of these documents because we were only interested in facilitating keyword searching of the text. Further, London Lives provides access to images of the originals, so it is easy for the user to understand how these documents are structured. However, five years later, the inclusion of this dataset within the Digital Panopticon project has required us to computationally insert the tabular structure because these structures are valuable aids for record linkage algorithms.

So why is our data so bad when it comes to re-use?

In the UK at least, funders have generally favoured digitisation that aids specific lines of enquiry rather than digitisation for discovery purposes. So although funding application guidelines such as the AHRC’s Technical Plan include requirements for applicants to make their data re-usable, the reality is that our datasets end up being structured so that we can ask very specific types of questions.

In theory this problem should be resolved through good data modelling and the inclusion of Web APIs that enable data to be transformed into a choice of re-usable formats. However, as all my examples illustrate, the majority of datasets that we find ourselves wanting to re-use were created some time ago and they are not necessarily resourced in a manner that enables them to be continuously upgraded to meet our current research needs. Even commercially-maintained datasets such as ECCO and the British Newspapers would probably need a significant dip in library subscriptions before their owners contemplated investing in better machine-readable data.

However, if we want our research to be data-driven, using large and broad datasets, so that we can ask the types of research questions that were unthinkable ten years ago, we need to address the disjuncture between our expectations and the reality of our data. Whether they be large-scale digitised resources or collections of bespoke databases, we must accept that the majority of the data in the digital humanities is not fit for these types of re-use. We must accept that the majority of time spent on any data-driven research project that involves digitised surrogates of historical sources will involve fixing the data to make it usable.

Michael Pidd is the HRI Digital Director at the Humanities Research Institute, University of Sheffield. His role includes providing strategic direction in developing its research base through partnerships with the research community in public and commercial organisations, initiating knowledge exchange opportunities on behalf of the faculty, and overseeing product/system design, build and delivery by the technical team.