What kinds of questions will big data allow us to ask and answer? How can we ensure this material is collected and preserved in such a way that it meets the requirement of humanities researchers? These are two of the questions addressed by a 12-month ‘Born-digital big data’ international research project led by Jane Winters, professor of digital humanities at the School of Advanced Study.

In recent years we have all become familiar with the notion of information overload, the digital deluge, the information explosion, and numerous variations on this idea. At the heart of this phenomenon is the growth of born-digital big data, a term which encompasses everything from aggregated tweets and Facebook posts to government emails, from the live and archived web to data generated by wearable and household technology (the Internet of Things).

The challenges of collecting, describing, preserving, publishing and analysing this data are huge. It is enormously rich, but it is also complex, messy and vast. For example, a single snapshot of the UK web (all websites with a URL ending in .uk) taken by the British Library in 2014 consists of 2.5 billion web pages and other digital assets.

It contains text, audio, video, and even 4.7GB of viruses. In June 2017, Facebook had more than 2.01 billion monthly active users, posting multiple types of information; and while concrete figures are difficult to find, on average 6,000 tweets are sent every second. A report published by The National Archives of the UK (TNA) in 2016 noted the existence of an email server in just one government department that contained half a billion emails.

Even if a particular social media platform might fade from popularity over time, these figures are only going to grow. We cannot afford to wait to start collecting this vital information: memory institutions have statutory responsibilities to preserve cultural heritage and the public record, whether digital or not; and researchers will have to get to grips with this new type of primary source if they are to understand the history of the late 20th and early 21st century.

In order to begin to answer some of the questions posed by working with big data, and born-digital archives of various kinds, the School of Advanced Study partnered with King’s College London, the universities of Cambridge and Sussex in the UK and Waterloo in Canada, The National Archives of the UK and the British Library in a new research network.

With funding from the Arts and Humanities Research Council, ‘Born-digital big data and approaches for arts and humanities research’ ran three workshops which brought together researchers and practitioners from different sectors – universities, libraries and archives – and different disciplines – history, linguistics, media and communication studies, computer science, library and information studies and digital humanities – to identify the key issues for organisations and individuals concerned with facilitating and undertaking research in the arts and humanities.

Topics of discussion ranged from web archives to the role of the archival catalogue in discovery, from personal digital archives to the document management systems used by business and government, from ethics to statistics.

There was, of course, only so much ground that could be covered in a year, and inevitably, one of the main findings of the network was that we are still far from identifying all of the problems let alone proposing the solutions.

But one question that did emerge very clearly from all three of the workshops is that of how we can ensure that the voices of individuals, and particularly of the under-represented and marginalised, are preserved and amplified in the huge born-digital datasets that are being collected by organisations like TNA, the British Library and many smaller libraries and archives around the country and across the world.

As big data research focuses on trends and patterns, it is important for humanities researchers to continue to capture, and tell stories about, ordinary individuals and ordinary lives. The web and social media in particular have provided unprecedented opportunities for people to act as creators and publishers of information, to interact with a range of corporate and public bodies in more or less public forums, but these contributions risk being overwhelmed by volume, becoming simply data points in a massive digital archive.

This is the other side of anxiety about privacy and anonymity in a digital environment, about ensuring that individuals cannot readily be identified from apparently anonymised datasets. Striking a balance between these two concerns will be of central importance for archives and researchers now and in the future, and the collaborations built by the ‘Born-digital big data’ network will help negotiate a practical and ethical path.

Professor Jane Winters is professor of digital humanities at the School of Advanced Study, University of London. She has led or co-directed a range of digital projects, including Big UK Domain Data for the Arts and Humanities; Digging into Linked Parliamentary Metadata; Traces through Time: Prosopography in Practice across Big Data; and the Thesaurus of British and Irish History as SKOS.