Image: © Shutterstock

Given the fragility of web data, and the increasing difficulty to access it freely, how can we establish online evidence as facts, historical evidence or truth?, asks Dr Anat Ben-David, a senior lecturer at the Open University of Israel. She is taking part in this month’s Web Archiving week organised by the School of Advanced Study and the British Library.

Facts matter. At a time when the very notion of truth is under attack, where scientific knowledge is under scrutiny and critical journalism is considered ‘fake news’, facts matter even more. But this age, recently dubbed the age of ‘post truth’, is also a digital one.

And while digital facts matter, born-digital facts are immaterial, fragile and easy to manipulate. The content of websites can be easily modified, tweets are frequently deleted, the number of Facebook comments and likes can be artificially boosted through click farms, and dubious sources spreading misinformation can be disguised as reliable news organisations.

The deterioration in the trustworthiness of the web as a source of knowledge is also tied with its gradual commercialisation. More than ever before, web data is primarily proprietary, and therefore subjected to platforms’ policies and constraints. Accessing raw data, with its full metadata, is becoming nearly impossible.

‘Given the fragility of web data, and the increasing difficulty to access it freely, how can we establish online evidence as facts, historical evidence, or truth?

Luckily, the Internet Archive (the digital equivalent of the ancient library of Alexandria) is one of the last non-commercial knowledge devices that can be used to establish historical facts from web data. It captures snapshots of websites at a specific point in time and preserves them for eternity. Through the Internet Archive’s Wayback Machine, one can browse the web’s past, and view pages that have been changed, removed, or deleted. Archived snapshots of websites serve as historical, born-digital facts. For example, the Internet Archive recently created a crowd-sourced collection that preserved federal government websites and data during the administration changes in the United States.

But apart from the technical knowledge that explains how the Internet Archive captures websites, what do we know about the knowledge production process that constructs archived snapshots as facts? How does the Internet Archive ‘know’ what to capture and at which frequency?

The research project that my colleague Adam and I will present in the upcoming RESAW Conference at the Web Archiving Week attempts to answer these questions. We unravel a complex socio-technical source contribution process behind what we eventually perceive as archived snapshots on the Internet Archive’s Wayback Machine. As a case study, we focus on the Wayback Machine’s rare snapshots of North Korean websites.

Although the .kp domain was delegated to North Korea in 2007, until recently little was known about its websites due to the country’s restrictive internet policies. On 20 September 2016, an error in the configuration of North Korea’s name servers allowed the world to have a rare glimpse of 28 websites hosted in the .kp domain. However, we discovered that the Wayback Machine displays captures these websites from as far back as 2010.

How did the Internet Archive come to ‘know’ about the existence of the North Korean websites years before the DNS leak? To answer this question, we undertake a ‘forensic approach’ to web archive research. Focusing on the Wayback Machine’s new provenance information feature, we analyse the history of the sources that ‘informed’ the Internet Archive about the existence of the North Korean websites over time, and cross this information with an internet censorship analysis that compares access to North Korean websites from different countries.

Our analysis shows that although most of the websites have been contributed to the Internet Archive’s crawler by experts, activists and ‘Wikipedians’, the Internet Archive’s combination of a distributed and automated crawling system with distributed source contributions result in a crowd-sourced knowledge culture that circumvents internet censorship.

As a knowledge device, the Wayback machine is a ‘leaky archive’, in the sense that it accumulates scattered contributions over time, which eventually generates knowledge that is otherwise only accessible in times of DNS ‘leaks’. However, although globally informed, the fact that the Internet Archives’ crawlers capture the North Korean websites from from servers physically hosted in the the United States greatly affects the extent to which certain URLs can be archived, and also determines the scope of their archival coverage. Put differently, the Wayback machine not only hosts successful captures of North Korean websites from as early as 2010, it also bears evidence of instances of failed attempts to access these websites from the US.

This complicated history, combining global source contribution with the geopolitics that shape the web’s information flows, helps us to better understand the politics, the knowledge cultures and the constraints that shape the Wayback Machine contents. A deeper understanding of the Wayback Machine as a knowledge device only strengthens its importance as one of the web’s last reliable repositories of born-digital historical facts.

Dr Anat Ben-David is a senior lecturer in the department of Sociology, Political Science and Communication at the Open University of Israel. She is co-founder of the Open University’s Open Media and Information Lab (OMILab). Focusing on new media, her primary research interests are history and geopolitics of the web, digital STS, social and political studies of social media, and digital and computational methods for web research.

Web Archiving Week events will be held in London at the Senate House headquarters of the University of London and at the British Library, 12th–16th June. It comprises three events: an Archives Unleashed datathon (12–13 June), British Library Knowledge Centre, a Researchers, practitioners and their use of the archived web three-day conference (14–16 June) at Senate House and the British Library Knowledge Centre, and a Web Archives: truth, lies and politics in the 21st century public event (evening of 14 June), British Library.