Chapter 1: The Environmental Scan as a method for digital source criticism at scale
Kaspar Beelen, Ruth Ahnert, Jon Lawrence, Katie McDonough, Daniel van Strien and Daniel Wilson
1.1 Introduction
The digital age has brought about a radical process of dematerialisation and, with it, decontextualisation. When our cultural artefacts are transformed into digital data, they often lose the context of their original creation. Until recently, many music streaming services stripped away contextualising information that had previously accompanied the physical record or CD. Classical music was particularly poorly served, even on platforms that offered lossless, hi-fi quality. Tracks might be labelled (and therefore discoverable) only by the performing artist or orchestra, rather than their original composer (e.g. J.S. Bach) or their title (e.g. Goldberg Variations).[1] And with tracks in any genre, information about who played what instruments, or about vocalists, session musicians, recording and release dates, or studio locations, is still rarely preserved, even though it was once easily to hand when the medium was a physical CD or record. Perhaps one day, the often meandering original sleeve notes will be digitised in their entirety and made available as we listen. Providers may even go further, and use AI to help structure the original information so that we can again see at a glance who sang that harmony or played that saxophone solo. Such loss may be little more than an inconvenience to the casual listener, but to a student of the history of music it represents a barrier that makes serious research impossible. Sadly, an equivalent loss of context bedevils almost all digitised collections.
Such contextualising information is known as ‘metadata’ and is critical for working with digitised collections of sources and documents. The loss of original metadata, which can situate documents in their proper context – in relation to their provenance as well as their social meaning – poses grave risks to users of digitised historical collections. As Lara Putnam has observed, while rapid digitisation has transformed access to historical sources via web access and keyword search, at the same time it has greatly diminished the importance of site-specific expertise and domain knowledge as a necessary precursor to making discoveries.[2] Just as the online listener risks being unable to know the who, when and where of their music stream, so also the scholar – whether using simple keyword searches or complex computational analysis – risks drawing the wrong, or at least radically decontextualised conclusions from their queries.
Performing a keyword search is very different from manually browsing library shelves or an archival collection. Of course, both can be shaped by confirmation bias – we tend to find what we are looking for – but keyword search does not facilitate the chance discoveries that so often come when we flick through a physical item or browse the shelves around a volume. As a result, the sense we might have of the broader context of our findings is likely to be radically attenuated.
For example, Beelen recalls observing this phenomenon in action when, more than a decade ago, tutors switched to using the recently digitised Belgian Parliamentary proceedings for a political history seminar.[3] They noticed that although searching was easier for the students, understanding the material was not. Using analogue search, the processes required to find relevant information functioned to equip students with crucial background knowledge, including source context, to help them critically interpret their findings. The students’ analogue searches involved perusing reference materials, cross-referencing these with unwieldy printed indices. Though challenging for many, these practices provided a framework and context to make sense of the information eventually uncovered. By contrast, when students used a search box to access their digitised sources, they might quickly stumble upon multiple fragments relevant to their topic, but each document existed in a historical and archival vacuum.[4] Although information was easier to access, making sense of the results was more difficult. The difficulty was compounded by the scale of information (they often felt overwhelmed by search results), as well as its quality (poor OCR meant some key sources had become invisible to their searches). This raised additional questions about what students did and did not find when navigating digitised collections.
The widespread roll-out of digitised sources for humanities teaching and research has rightly been widely welcomed, but critical voices have begun to raise concerns about the distorting consequences of how scholars use digitised sources. Critics worry that the ease of accessing these materials may inadvertently skew research priorities and influence academic arguments in unexpected ways.[5] There can be no simple, universal answer to this problem given the diversity of digital sources and the myriad research questions that can be addressed when using them. Different solutions will be required for different contexts, but there are some core principles that can help tackle the unintended distortions that can creep into digital methods if they are used uncritically. As well as adding important context to digitised collections, we set out some core principles for placing computational methods on a more robust footing by scaling up the critical methods central to the ‘close reading’ of sources so that they are also available to humanities scholars engaged in ‘distant reading’.[6]
From the outset, the Living with Machines (LwM) team was determined to tackle these issues by addressing a fundamental question: how can humanities scholars’ core techniques of source criticism be adapted for use at the much larger scale made possible by the digital revolution.[7] Faced with a report about, for instance, an election disturbance taken from a Victorian newspaper, the first thing a historian might ask is: what were the politics of this paper, where did it circulate, and what readership was it aimed at? Often, the answers to such questions would be part of the historian’s implicit knowledge, built up cumulatively from working intensively with a relatively small cluster of sources. But if the source was unfamiliar they would be able to consult a contemporary reference work to find out more. In the case of provincial newspapers, this would usually mean turning to one of the various annual ‘press directories’ published from the late 1840s as a guide to advertisers and the publishing industry.[8] On LwM, the question we asked was: could something analogous be done ‘at scale’?, i.e. for an entire collection of digitised newspapers so that one could both integrate this sort of information into conventional digital searches (as ‘enhanced metadata’) and also apply it to the collection as a whole to understand its biases and representativeness. This was how the concept of the ‘digital Environmental Scan’ was born.[9] As the following narrative will show, while the principles at the heart of this methodology were developed first through our research on nineteenth-century newspapers, they are pertinent to any digital collection for which good sources of supplementary reference metadata can be identified, digitised, and processed.
To understand the need for such a digital Environmental Scan we must recognise that data, regardless of its scale or granularity, is always incomplete. Rob Kitchin describes data as 'oligoptic' rather than 'panoptic', challenging the notion—often prevalent in the sciences, but also sometimes in the humanities—that big data necessarily offers comprehensive and transparent views.[10] The massive increase in data collection during the 2000s led to bold claims of paradigm shifts and a new empiricism, the idea that instead of developing theories and conceptual frameworks, researchers need only find efficient ways to query big data for answers.[11] While such views have been widely criticised among humanities scholars and theorists, they have by no means disappeared. Researchers in computer science and AI sometimes need reminding that bigger data isn’t necessarily better data, as issues of bias continue to surface in their models and applications. This is what Kitchin means when he argues that, ‘all data provide oligoptic views of the world: views from certain vantage points, using particular tools, rather than an all-seeing, infallible God's eye view.’[12]
Most historians would typically agree with this assessment: working at the individual document-level, they construct and refine their arguments by contextualising sources and determining what perspective these offer on the past. However, when faced with vast collections comprising millions of pages, or thousands of map sheets, historians lack the instruments to scale up their source criticism even though, when pressed, few would expect these resources to be neutral or impartial. Hence the key question is: how can we articulate this critical perspective when using extensive historical collections at scale, including interrogating what might be missing from our data?
The Environmental Scan represents a response to this challenge, which began as a method to undertake the kind of source criticism at-scale described above. It was developed to work with the digitised newspapers collections, created through partnerships with the British Library over the last couple of decades (the JISC digitised newspapers and the British Newspaper Archive). This work developed into both a series of precise technical workflows for this particular use case, and also a set of broader principles that could be applied to other sources and collections. In the following pages we seek to illustrate those principles with the more specific technical steps and use cases from our work on the Living with Machines (LwM) project. As well as discussing the process of undertaking an Environmental Scan of the newspaper corpora, we show how we extrapolated the principles developed on this data to other digitised collections; namely, the nineteenth-century Ordnance Survey maps digitised by the National Library of Scotland, and a printed book corpus digitised through a partnership between the British Library and Microsoft. These other examples show how the principles can be generalised to very different collection types, if contextualising sources can be found. In sharing an account of this research we have two aims. Firstly, we believe that the Environmental Scans that we undertook for these three collections represent important foundational work. These sources are vital for the study of the nation's past, and therefore a rigorous understanding of the true contours of these sources in their digital forms is of paramount importance. We hope that by drawing attention to the methods and datasets that we have generated for the particular use cases that people will deploy them, and build upon them, in their own research.
Secondly, we wish to share a kind of manifesto outlining the broader values and principles that should be at the centre of any Environmental Scan. By drawing these out we hope that people can adapt them for their own research. We see a variety of different potential audiences and beneficiaries. We believe that the kind of work we are advocating here complements the existing work being done by GLAM organisations, and may offer curators the ammunition they need for digitising reference works, releasing metadata, or undertaking crowdsourcing projects that would facilitate the kind of analysis we discuss in the following pages. For other researchers already working in the space of computational humanities, or entering into it, we hope that they will begin to see this kind of source criticism as the precondition of rigorous analysis of the contents of digital collections. This is foundational work that must be done before more focused research questions are tackled, for without a sense of the shape of our data, how can we draw sure conclusions? In advocating for the necessity of such work, we also hope to add our voice to the more pressing debate that is playing out around the development of AI technologies. While our examples all derive from digitised heritage collections, the arguments are equally pertinent to the training data fed into AI pipelines - and the stakes, of course, are much higher.[13] Indeed, we would contend that digital historical archives present a valuable testbed for illustrating the biases that are baked into any dataset, and the ways that we must seek to address those explicitly in the way that we design any research or uses of such material. Here is an opportunity for the tech industry to learn from digital historians and curators.
1.2 Framing Principles: the Deep and Wide Scans
The term ‘machines’ in the ‘Living with Machines’ projects title carries an intentionally ambiguous meaning. It refers not only to the machines that emerged as part of the Industrial Revolution, but also to how today’s historians might engage with novel research instruments being developed by data scientists, including machine learning and other forms of artificial intelligence.
We view the digital Environmental Scan as a practical example of this approach. It offers historians innovative tools for critically engaging with large heritage data. Regardless of size, data are always rooted in specific contexts and shaped by particular factors. However, when faced with vast amounts of digital data, these histories and contexts often become obscured, leaving researchers with a somewhat indecipherable agglomeration of information. When we query these datasets, such as through keyword searches, they will invariably yield some results—but often without providing context about the source or, perhaps more importantly, about what might be missing. We cannot see what hasn’t been digitised, or know what portion of the collection is unsearchable because of mistranscription. This might sound self-evident but it still needs to be kept in mind when working with digitised historical collections. The problems the digital Environmental Scan addresses are often acknowledged but rarely investigated, generally because historians (digital or otherwise) lack the tools and methods. The primary goal of the Environmental Scan is therefore to provide a framework and set of methods to contextualise large-scale digitised, historical data. It involves two key approaches which we might think of as the framing principles of the Environmental Scan: these are the ‘deep scan’ and the ‘wide scan’.
The principle of the ‘deep scan’ is to uncover the hidden or latent dimensions that structure a dataset. Practically, this means enhancing or recovering information about the provenance of component sources bundled in a digital collection, including, where possible their ‘perspective’. This approach acknowledges that while digital collections often include basic metadata, such as publication and/or digitisation date, they typically lack essential contextual information. For example, in digital newspaper collections, we may have access to the content but not to details about who originally created these materials, or important clues to who may have consumed them (e.g. price, declared politics etc.). The deep scan seeks to fill this gap by automatically extracting metadata from historical reference resources and linking it to the digitised collection, thus offering richer descriptions and more detailed profiles for each data point. In their 2025 paper, ‘Provocations from the Humanities for Generative AI Research’, one of the provocations that Lauren Klein and her co-authors made was that:
there is always a perspective that is shaped by training data, annotations, parameters, and prompts. Rather than chasing a “representative” dataset—which, of course, can never be achieved—we would be better served by acknowledging that there are always perspectives encoded in our models, and by taking the time to document and name them.[14]
Our deep scan is a principle and set of methodologies for precisely documenting and naming these hidden perspectives. By enriching collections with additional layers of metadata, the deep scan articulates the social, political, and historical dimensions of digital objects that otherwise risk remaining opaque and shapeless. Essentially, it aims to encode context informed by the user’s research question.
The ‘wide scan’ complements the deep scan by examining how digitised materials fit within a wider ‘landscape’ of information. We use this metaphor of a ‘landscape’ to refer to the entirety of a particular source type, for instance all newspaper issues known to have existed within a specific spatial and temporal context. The wide scan focuses on assessing the characteristics of a digital archive or collection, with respect to those of the wider population, of which it is a sample, and identifying any significant biases. This approach is crucial because digitised objects often comprise a small portion of the broader landscape of un-digitised documents (including those never archived or now lost). The wide scan encourages researchers to consider the composition and biases of digital collections, even when working with large-scale collections. It suggests that collections can be greatly enhanced not only by providing richer metadata (the deep scan), but also by providing detailed information on content selection and biases in composition (the wide scan).
By applying the principles of deep and wide scans to digitised collections, researchers can achieve the goals of large-scale source criticism. This approach transforms the way we critically examine digital archives and opens up new research possibilities in the humanities. It facilitates a more nuanced understanding of the limitations and potential of digital collections, allowing researchers to draw more accurate and contextually informed conclusions from their analyses. This involves not only creating new metadata fields but also describing ‘absent’ data. A significant part of our work relies on what we call ‘historical reference metadata’, which consists of data descriptions sourced from historical reference works.
In the following sections we lay out the more precise steps and methods needed to implement these framing principles, and how these vary with different data types and specific collections. These methodologies enable a more robust and reliable examination of digital collections and are a crucial step toward making large-scale source criticism a reality.
1.2.1. Identify Contextual Sources
The digital Environmental Scan was developed initially through our work with the digitised newspapers. It allowed us to identify a more general set of problems, as well as principles and workflow to address these. The first step in this was the identification of historical reference works which could be used to evaluate the composition and content of newspaper collections. These offer contextual information that is absent both from current library systems.
In the case of the newspaper sources, our chosen reference source was the series of Newspaper Press Directories. The nineteenth century saw the emergence of a wide variety of directories, statistical compendia and other reference works seeking to offer systematic, scientific representations of society and economy to readers in business and the professions for whom knowledge was power.[15] As Laurel Brake notes, the newspaper press directories were one of a range of such publications which emerged specifically to meet commercial needs, particularly the needs of publishers, advertisers and their agents.[16] The longest-running and most famous of these publications was Mitchell’s Newspaper Press Directories, which began in 1846, and appeared annually from 1856. In 1856 the Post Office recognized Mitchell’s Directory as an authoritative list of British newspapers. Although it faced competition from rival publications after 1870, Mitchell’s retained its authoritative status well beyond the First World War.[17]
Newspaper directories played a crucial role in the history of media and advertising in the nineteenth century, collating newspapers’ self-representations, codifying newspaper types, and facilitating the emergence of a knowable national press long before a few London-based titles came to dominate the newspaper landscape. As Tom O’Malley notes, from the outset directories sought to bring dignity and respectability to the newspaper industry: in its first edition Mitchell’s declared its intention to offer ‘a more dignified and permanent record’ of the press than any previously available.[18] Mitchell’s served the interests of advertisers, newspaper proprietors and journalists, providing detailed information on frequency of publication, area of circulation, pricing, political stance (if any), ownership and more. This is contemporary information not easily recovered from other sources, including standard library catalogues. It is this information that we have sought to extract and structure as enriched metadata.
Directories operated as a geographical index of the press, mainly distinguishing between the London and ‘provincial’ newspapers, which were in turn subdivided first by nation (the English, Scottish, Welsh and Irish press usually had separate sections), then by county, and finally, at the lowest level, by town or district. As O’Malley notes, directories both described and shaped the press with their typologies and their choice of features to foreground. Figure 1.1 shows an excerpt from Mitchell’s 1908 directory, describing an English title, the Chatham News from Kent. In addition to information about price, frequency, declared politics (here ‘Independent’), longevity and ownership, each newspaper profile included information about where the paper could be purchased and the type of coverage it offered. Entries were often cross-referenced, as here, to fuller paid advertisements to be found within the Directory, and were preceded by brief descriptions of the locality. Our task on LwM was to find a way systematically to structure this rich information so that it could become ‘enhanced metadata’ for researchers seeking to work with digitised newspapers.
1.2.2. Create Metadata from Reference Works
While it might be relatively simple to find directories or reference works to contextualise a dataset, turning these sources into metadata is non-trivial. Such works were designed to be read by researchers, not machines. The press directories illustrate the complexity of the task: creating reference metadata from the bound volumes of the press directories involved multiple steps. In 2019, the Living with Machines project requested the British Library Imaging Studio to digitise all editions of Mitchell's Press Directories, spanning from 1846 to 1920. Led by Beelen, the project then further processed these scans to extract the text and make it machine-readable. The principal goal was to convert scans of the pages into a structured format (i.e. tabular data) . There were three separate steps to this:
- transform page images into text through optical character recognition (OCR)
- detect individual entries
- automatically parse the content of each entry for the attributes of the newspaper that would be tuned into enhanced metadata (i.e. information about newspaper titles, prices, and political leanings etc.).
As is often the case with heritage collections, this task proved to be more complex than initially expected, but we were keen to develop an automated method, rather than rely on manual transcription, in order to demonstrate the potential for applying this method to multiple collection types, regardless of size.