Michael Pidd, director of the University of Sheffield’s Digital Humanities Institute, explains how, while trying to understand the evolution of early modern thought by modelling the semantic and conceptual changes which occurred in English printed discourse between c.1500 and c.1800, they created a powerfully transparent tool for exploring texts in meaningful ways.
Funded by the Arts and Humanities Research Council (AHRC), the Linguistic DNA project, a collaboration between the universities of Sheffield, Glasgow and Sussex, and the Digital Humanities Institute, aimed to understand the evolution of early modern thought by modelling the semantic and conceptual changes, which occurred in English printed discourse between c.1500 and c.1800. However, while developing algorithms that could do the modelling computationally, we engineered a new way to retrieve information based on meaning and context rather than simple keywords. We call this process ‘concept-modelling’.
What is a concept?
A concept such as democracy, for example, can be inferred in natural language data from a range of co-occurring lexical terms, such as freedom, election, government, suffrage. We model concepts as co-occurring trios (sets of three terms that actually co-occur across a single span of text) which may be adjacent to each other, or separated by up to 50 tokens. That is, they may co-occur outside the level of the phrase or clause, across relatively large spans of discourse.
Trios, which co-occur in spans of texts and pass specific statistical thresholds, can be conceptually related. For example, repeated co-occurrence of democracy, election, and suffrage would suggest particular characteristics of the concept of democracy. Additionally, trios that include democracy will change over time (diachronic) as the concept of democracy evolves, and within specific times (synchronic), because different authors will have different notions of what democracy is.
The trios, democracy-freedom-election, democracy-war- fascism, and democracy-Athens-history, represent important conceptual differences. This is far more nuanced than ‘topic modelling’ and similar methodologies, and more transparent and user-friendly than distributional semantic methods like Word2Vec. Unlike previous approaches, a user can view each trio in context, reading each example in its original sentence or paragraph making it a powerfully transparent tool for exploring texts in meaningful ways.
The concept-modelling process
To identify the lexical characteristics of concepts, and how they change over time, required us to develop an unsupervised and entirely data-driven computational process that was capable of analysing the entire Early English Books Online Text Creation Partnership (EEBO-TCP) of 58,000 texts (approximately 5 million pages). Taking each word and comparing it against every other word required seven days of continuous processing on a standard PC with eight virtual cores. We used distributed processing, specifically Amazon’s AWS, to bring the computation time down to two hours. The output was 15 billion rows of trio data – 15 billion individual trios such as democracy-freedom-election and democracy-war-fascism – including all trios for all medium- or high-frequency words. We are now able to interrogate our trio data further using search and data visualisation techniques such as cluster analysis. (See public prototype).
In pure research terms, we have evolved the process to identify quads, sets of four terms that actually co-occur across a single span of text, which gives us a richer understanding of the lexical characteristics of concepts, and we intend to begin modelling the relationships between quads (ie how two or more quads relate to one another).
Journals and YouTube
We subsequently applied the concept-modelling process to journal articles in the social sciences in order to determine the scope and characteristics of research concerned with technology and digital society. Led by the University of Liverpool, the project was commissioned by the ESRC to help inform its strategy for grant giving. We then applied the concept-modelling process to approximately 6 million comments related to militarisation videos on YouTube (eg videos promoting weapons procurement and videos about military-based video games). Funded by the Swedish Research Council and led in the UK by the University of Leeds, the project aimed to understand discourses around militarisation in the absence of a ‘comments search’ on the YouTube website.
Whereas the Linguistic DNA project had always been about linguistics, these two subsequent projects used the concept-modelling process for discovery purposes, demonstrating a wider value for our methods which was reinforced by consultation with potential stakeholders such as the BBC, Oxford English Dictionary, ProQuest, Adam Matthew Digital, The National Archives, Wellcome and The British Library. The remainder of this blog post briefly introduces two possible applications of concept-modelling: semantic search and automatic metadata generation.
Applications #1: semantic search
Conventional search is keyword based, whereby the system looks for literal string matches against the search term. When searching very large text corpora this is an imprecise method. A search for the word air will return documents about the sky, oxygen, breath, melodies and outward appearance.
Concept-modelling facilitates a more semantically informed approach to searching. Instead of a system looking for the search term in the corpus of documents, it looks for the term in our index of trio or quad data, then presents the end-user with matching trios or quads. The user is then able to select the trio or quad which most closely matches her or his intended usage – such as air-water-earth, air-music-instrument, or air-face-demeanour – and the system returns the specific spans of documents in which this trio appears.
In addition to user searches, our trio data can be used as a recommendation engine – it can identify the strongest trios in the text that a user is currently reading, and recommend other texts with similar strong trios. Such a recommendation engine can lead users to texts they might never have otherwise found, particularly in very large archives where many texts are almost never accessed.
Two separate Proof of Concept projects with Oxford English Dictionary (OED) and the BBC are exploring this approach. The OED project is developing an interface to enable its editors to more quickly, and accurately, identify typical example sentences for each detailed sub-sense of a word, drawn from EEBO-TCP, which can be used in new and revised word definitions.
The BBC project is developing an index of quads from its Radio News Scripts collection of around 180,000 scripts / 2.3 million pages dating from 1940–90. It will then experiment with using the index to underpin new forms of search and content visualisation (eg navigating an archive conceptually).
Applications #2: automatic metadata creation
Concept-modelling has an application within visual, audio, and multimodal collections where there is a desire by content owners to make media objects searchable for end-users but an absence of quality metadata on which to search. For public archives in particular, creating descriptive metadata is too time-consuming and expensive. By concept-modelling a corpus of text discourse which is about the works in the collection (a corpus of art history texts if the collection is artwork; a corpus of film or music reviews if the collection is film or music etc), we are able to generate quads which can then be associated with relevant media objects.
Further, our quads will produce conceptual relationships between media objects which would not necessarily occur when descriptive metadata is created manually using predefined schema, ontologies and taxonomies, because the concepts are derived from discourses which are frequently concerned with aesthetics, interpretations, influences, and relationships between objects.
Applying concept-modelling in practice
In practice, we envisage both the above use cases as a service model, similar to the BBC project. A content owner provides the Digital Humanities Institute with a full-text corpus, we process the corpus to extract concept-models in the form of quads, and then return to the content owner the index of quads for them to use in any way they wish to enhance discovery of their collection.
Michael Pidd has more than 20 years of experience in developing, managing and delivering large collaborative research projects in the humanities and heritage subject domains. He is/was principal investigator on: Connecting Shakespeare (HEIF), Dewdrop (Jisc), Reinventing Local Public Libraries (HEIF), and Manuscripts Online (Jisc); and Co-Investigator on the following projects: Intoxicants and Early Modernity (ESRC/AHRC), Linguistic DNA (AHRC).