Image: © Tatiana Shepeleva / Shutterstock

Can a computer create a scholarly digital edition? Can a computer compare multiple texts, identify the order in which they were perhaps written, critically judge whether one text is ‘better’ than another, and then produce an authoritative variorum edition? The answer is increasingly: yes, the computer can do all of these things. In this post Michael J Pidd, who oversees the work of HRI Digital at the University of Sheffield’s Humanities Research Institute, explores how computational processes have become central to achieving the kinds of editions that we require today. It is based on a talk given in March at Cologne’s DIXIT Convention.

It is generally agreed that a scholarly edition is the selection, annotation and presentation of primary source evidence in ways that are reliable for scholarly use. A digital edition is one that, at a minimum, makes an edition accessible to its readers using digital media as opposed to printed media. A digital edition can be entirely hand-crafted by humans, whereas a computer-aided edition uses computational techniques to perform some of the selection and annotation of primary source evidence. In a computer-aided edition, the computer performs some of the critical thinking tasks that we traditionally associate with scholarly editing. A computer-aided edition does not necessarily mean that the end result is a digital edition. The end result could be a printed book.

In Sheffield we have been developing scholarly editions for more than 20 years. During that time we have found ourselves asking computers to undertake more and more critical thinking tasks as our desire to create editions from large collections of primary sources has increased, but the funds to create them has decreased. On some projects the scholarly editor is now reduced to performing a critical review of work that is the product of algorithms.

For example, the Digital Panopticon project, led by the University of Liverpool, aims to trace the lives of 90,000 people who were convicted at the Old Bailey and sentenced to death, imprisoned or transported to Australia between 1780 and 1875. The end result will be an online edition of 45 historical sources and almost 90,000 biographies.

The convicts are incredibly well documented thanks to the obsessive and meticulous record-keeping of the British colonial prison system. We know everything about their lives, including health and biometrics, during and after their incarceration. However, what we do not know is whether ‘John Smith’ in Document A and ‘John Smith’ in Document B were the same person. This is the project’s chief technical challenge: nominal record linkage. We need to trace the lives of 90,000 individuals across 45 historical sources, and many of them have common names.

The task is a form of family history on speed. The scale and complexity of the decision-making process is such that we can only undertake the record linkage computationally, using methods developed by the project’s technical lead, Jamie McLaughlin. We have to train the computer to identify contextual information that tells us something about each person’s name, and then compare this with information that appears in every other document. During the comparison the computer has to apply rules that are perhaps obvious to us, such as: a person cannot die twice; a person cannot be in multiple places at the same time; and a person is unlikely to live beyond 120.

We also have to train the computer to be cognisant of the problems that can occur in record keeping. For example, a convict might spell his or her name in different ways; documents might use different terminology for describing the same crime; and information might not be accurate due to lazy or unknowledgeable civil servants. Sometimes people simply disappear from the records, leaving us to ponder whether the disappearance is genuine (eg due to death) or our algorithms are wrong. This type of record linkage can be difficult for humans to unravel, especially when dealing with common names. It is just as difficult for computers.

Digital Panopticon reveals the most common reason for using computer-aided techniques: the scale of the data is too great and the complexity of the critical thinking task is too labour intensive for humans alone. Many years earlier we became acutely aware of the limits of human endeavour during the creation of our online edition of John Foxe’s Acts and Monuments (often referred to as his Book of Martyrs).

The John Foxe project aimed to create a variorum text of the four editions that were created during his lifetime. Each edition is in excess of 2,000 pages. With an average of 1,300 words per page, the complete dataset contains over 10 million words. The text was transcribed by multiple researchers. These same researchers also tagged name data, identified editorial movements and changes across the editions, and supplied a wealth of historical background information. The project started in 1992. We finished it in 2011, almost 20 years later!

The result is a hand-crafted edition that suffers from inconsistencies due to the scale and complexity of our ambitions. For example, the project only had time to tag names in two chapters in just one edition, and its attempt to trace variants manually across all four editions is difficult to verify. More problematical, in some places the project was unable to implement a consistent page numbering system that addressed the inconsistencies in the original publisher’s own system of pagination. This makes it quite difficult to correlate between the digital version and the original printed books.

Many of these problems could have been effectively, consistently and more speedily addressed using computational techniques, but such approaches were beyond our imagination in 1992. The real achievement of Acts and Monuments Online is the quality of the text itself, which was painstakingly transcribed from the original editions.

The Digital Panopticon and John Foxe projects suggest that digital editions are at their best when they have been created using techniques that are beyond the realm of labouring editors. This means using computational techniques to create new texts and new knowledge that is beyond the ability of human scholars or editors due to the scale and complexity of the critical thinking tasks.

Yet we should not lose sight of a key problem with computers: they can only make decisions based on the data that is available to them, using the rules that they have been given. Computers are unable to generate scholarly annotations based on any wider knowledge or experience – that is where the scholar is invaluable. We should remember that computer-aided editions have benefited from the critical thinking of computers precisely because scholars have laboured to encode their wider knowledge and deductive processes as rule-driven algorithms. The Digital Panopticon’s record linkage works because Jamie McLaughlin and his colleagues have invested much time and energy training the computer to understand the characteristics and inconsistencies of human biography, historical documents and record keeping.

Computers and their algorithms do not simply ‘take over’ critical thinking tasks. They must be engineered. This might suggest that a new relationship between the scholar and the machine is emerging within the digital space, especially for scholarly editing. It is a relationship in which the scholar is one step removed from the text, but also one in which there is a greater requirement for the scholar to understand the reasoning and rules behind their own deductive processes so that they can encode this within the computer.

Michael Pidd is the HRI Digital Director at the Humanities Research Institute, University of Sheffield. His role includes providing strategic direction in developing its research base through partnerships with the research community in public and commercial organisations, initiating knowledge exchange opportunities on behalf of the faculty, and overseeing product/system design, build and delivery by the technical team.