10. Using technology creatively: digital history
SIGNPOST: Relevant to all researchers
The bold and beautiful tiger, bursting through the barriers, is a symbol for the computing power of digital history updating traditional research processes.
10.1 Non-stop learning
Whether the research project focuses upon the history of poetry or economics – whether it is about gender, discourse, diplomacy or cash flows – the way historians analyse their sources has changed substantially in the last thirty years.1 As a large (though selective) proportion of inherited print and manuscripts have been digitized, all historians have become ‘digital’. They access many of their sources as digital files and write history while sitting in front of a screen.
Furthermore, ‘born digital’ sources are increasingly important, and images that were traditionally inaccessible in museum collections are now much more available – and available for analysis – in digital form. For most historians, ‘the object of study’ (whatever its nature) is often now a collection of digital files. As a result, and however well or awkwardly they manage, all who study the past are ‘digital historians’ now.
A whole new series of tools and approaches to analysing this new data has emerged. Most are available online and all involve some form of computational analysis. Many methodologies have been borrowed from other disciplines. Corpus Linguistics (the statistical analysis of large bodies of text) allows historians to review inherited texts in new ways, while geography brings its knowledge of space and mapping, via sophisticated Geographical Information Systems (GIS). Computer science also adds techniques like ‘topic modelling’, network analysis and the statistical analysis of graphics files. And the broader world of ‘big data’ has lent tools for the organization and cross-analysis of all forms of historical evidence.
For historians, the big challenge is to know what tools to use – and, equally, how much effort to expend in learning how to use them. Most researchers into the past don’t want to specialize as computer programmers and/or as geographers.
Hence it’s crucial to know where to find good advice and, simultaneously, to know when to get back to the archive or the original source deposit. In other words, historians need to have enough information to make the right choices. Digital history must be done well, and neither misused or confused.
10.2 The Programming Historian
The single best starting-point for anyone seeking to use digital methodologies when working with historical materials is The Programming Historian (PH).2 This admirable resource is now published in three languages (English, French and Spanish). It presents dozens of tutorials, written by historians, and specifically designed to support fellow researchers. Users can thus identify the most helpful tools, and develop the basic skills to apply them credibly.
Started in 2008 by William Turkel and Alan MacEachern, the PH has grown into a major international initiative. It is staffed by volunteers; and, year by year, it augments and revises its set of tutorials. Currently, there are 150 published tutorials, ranging from an ‘Intro to Google Maps and Google Earth’ and an ‘Introduction to Jupyter Notebooks’ to more advanced topics like ‘Manipulating Strings in Python’. Each tutorial is built around historical data and should take no more than an hour or so to process.
Most historians will not need to work through more than a handful of tutorials to undertake all the data-processing jobs required by a substantial research project. But it is worth getting to know what is available; and it’s advisable to return to the PH when confronted by a new task or data challenge.
Some will want to become expert in one specific technique or approach. The most important thing, however, is a willingness to investigate. Researchers should be open to learning what they need, as and when they need it. Sampling a few tutorials is a fine way to explore the options and to check whether any specific approach will be useful.
Needless to say, the perennial advice given to researchers also applies in this case. Always check the time/benefit analysis. It’s vital not to overlook crucial research tools – just as it’s disastrous to embark on some great but labour-intensive approach, which will take longer than the time available for the project. Here, advice from fellow researchers and supervisors can be invaluable.
10.3 Text as data
Much historical research involves reading ‘text’, whether this takes the form of published books and newspapers, manuscript archives or a million Tweets. And in all cases where the texts have been digitized (or emerged ‘born digital’), they have become a form of ‘data’. That is, they constitute strings of code, which can be analysed at scale.
This new characteristic of inherited (historic) texts has provided one of the great ‘affordances’ of modern scholarship, creating a large number of new opportunities. Being able to count how historical language changes over time is remarkably revealing. To take a single example, an analysis, showing how different vocabularies are used to describe men and women’s work across different sources, offers a powerful reflection upon the workings of patriarchy.3 Or a study tracing the spread of new technologies (through the occurrence of the new words used to describe it) would serve as a way of illustrating broader trends in changing socio-cultural attitudes.
At its simplest, moreover, counting the frequency of words in a single article will help to elucidate its core message. A quick and straightforward way of testing changing language over time is provided by the Google Ngram Viewer.4 This tool allows researchers to chart the frequency of any word they choose against a large corpus of published text. Entering a simple search string, such as ‘iron, steel’, and charting its usage against English publications since 1700 would provide a rapid measure of evolving industrialization. Meanwhile, for those wishing to make a quick check for leads in the archive of Western texts, the Ngram Viewer also works well. Yet most historians work on a more specific collection of materials. Hence researchers need to create a collection or ‘corpus’ of texts for their own purposes.
Before starting, it’s essential to ensure that the chosen ‘text’ is useable and consistent. Many digital resources online provide only an image of a page rather than transcribed texts, making it hard to generate a reliable corpus for advanced study. If the pages are printed, it is normally possible to use an Optical Character Recognition (OCR) tool, of the sort that is available in common PDF readers such as Adobe Acrobat.5
However, manuscripts – if not already transcribed by someone else – will require a different approach. Even web resources that provide a digital text are frequently of poor quality. A source such as the Burney Collection of Seventeenth- and Eighteenth-Century Newspapers, for instance, which provides both images of the original page and OCRd text for searching and analysis, is marred by an almost unacceptable error rate. In the underlying text used for searches, less than half of all semantically significant words are accurately transcribed.6 Errors generally arise from lack of visual clarity in the original eighteenth-century printed pages, which can often be faded and tattered. These kinds of transcribed sources – including Google Books – are still useable, since the scale of the collections makes topics findable, even given the errors. Nonetheless, researchers need to understand their intrinsic level of errors.
Once having established the extent of precision required, researchers then need either manually to gather the material in which they are interested; or, alternatively, to ‘scrape’ a site automatically to collect relevant elements. The Programming Historian has several tutorials that can help. And, when working with published books, some organizations such as the Hathi Trust Digital Library7 can also provide accurately transcribed texts.
Then researchers need an appropriate set of tools that will allow them to interrogate the corpus. An easy way to begin is using an online environment such as Voyant Tools.8 Created by Stéfan Sinclair and Geoffrey Rockwell, Voyant allows researchers simply to copy and paste any text into an online box, or to upload a file or set of files from their own computers, or to enter a URL for a public website. Voyant will then provide a set of word frequency tables and visualizations that will help to reveal underlying patterns.
10.4 Quantitative data as data
More regular data, whether entries in an account book, Parliamentary returns or any form of consistent record for that matter, will also benefit from analysis using either a database or spreadsheet. As with text, the most important first step is to know the data. Some projects are based around a deep engagement with a single source; and, in such cases, it is very helpful to survey the full range of material available from beginning to end, before deciding precisely what technology to use. Even apparently consistent series of records tend to change subtly decade by decade. Other projects build upon a variety of sources about a single topic or individual, and often involve working with a range of different discrete datasets. Again, researchers should engage closely with the material and then consider the basic options. The first decision is how to get the data from the historical records into a useable form; and the second is what ‘package’, or set of tools, to use in their analysis and visualization.
Two basic approaches to data organization take the form of databases and spreadsheets. The first has traditionally been used for recording textual and mixed textual/numeric data, while the latter evolved from accounting and is associated with recording detailed numerical data. In fact, however, both databases and spreadsheets do almost precisely the same job, and do it in very much the same way. Underlying both is a series of ‘cells’ or ‘fields’ that collectively and in their raw form appear as a simple table, with each row representing a separate item, and each column a category of information. Data from either can then be exported into a straightforward CSV format that can in turn be used in any data-processing package.
Databases have the great advantage that they facilitate data entry – researchers can easily design a bespoke data entry ‘form’ for this purpose. Yet databases do not lend themselves readily to statistical analysis. In contrast, spreadsheets such as Excel are slow to populate with accurate data, but tend to include more tools for manipulating and analysing the data. Whichever is chosen, it remains essential to be as true to the original sources as possible.
It is worthwhile starting with a simple exercise, such as creating a table in Word and seeing how easy (or otherwise) it is to populate with information. The process of entering just a few pages of historical materials will reveal key problems with labels and categories, and allow further consideration before completion. It is crucial to ensure that basic classifications can be trusted and shared with other researchers. Idiosyncratic inventions may seem fun at the time, but are valueless if rejected as meaningless by other researchers.
There will always be ‘edge cases’ – instances that can fall into more than one category, or which are simply ambiguous – but a good understanding of the material will help keep those to a minimum. It is also vital to respect the sources, and to include as much data as possible. When starting out, it is easy to assume that the initial questions should limit what is transcribed. But including as much data as possible – even when it initially seems tangential – will make the data more robust and also more capable of being re-used. Research questions change in the course of study, and data that seemed irrelevant at the start could well be crucial in the end.
Having decided how much data to record and in what format, the next step is to determine how best to enter it. This process can be tedious beyond measure. Here databases have an initial advantage, in that it is relatively easy to create a bespoke ‘data entry form’. And entering data into a well-designed template – tabbing from field to field – is infinitely less irritating than trying to navigate a spreadsheet. At the same time, however, spreadsheets will also allow researchers to create a data entry ‘template’, and it is worth doing so.
For some purposes, regular data is already available. The UK Data Archive9 and services such as Zenodo10 hold thousands of datasets that can be downloaded and re-used for free. And while other people’s data is never quite right, and will need to be adapted and cleaned in services such as Open Refine,11 it is always worth checking what is available before sitting down to build a new dataset. Incidentally, gaining access to data held by commercial companies such as FindMyPast.co.uk and Ancestry.co.uk is much more difficult, but always worth a try.
Whether, in the end, the decision is to re-use already established datasets, or to create a simple spreadsheet, or to populate a complex relational database, the next step is to see what the data says; and for such analysis, different tools are needed. Traditionally, very large social science datasets were analysed using SPSS (originally devised as a ‘Statistical Package for the Social Sciences’, and now renamed as ‘Statistical Product and Service Solutions’).12 But for most historians, who do not work with huge social science datasets, using SPSS results in overkill. Instead, the in-built tools within spreadsheets, such as Excel, usually prove more than adequate.
Moreover, because both spreadsheets and databases export to common formats such as CSV, data first created in them can be re-used in more flexible visualization environments. Examples include Gephi, which is an open source for visualization of graphs and networks,13 and the appealingly named ‘Many Eyes’.14 Historians working on networks, in particular, have found in Gephi an all-important set of tools. But requirements are always changing, so it’s helpful to survey the options regularly.
10.5 Space and place
Geographical information has in particular attracted its own sophisticated set of tools and research environments. From the early 1960s, geographers began to develop a series of conventions and tools for working with computers that eventually evolved into a set of distinct approaches and a suite of commercial software. At its most sophisticated, this software has become intimidatingly complex and expensive to access. Nevertheless, it does allow researchers to manipulate and analyse specific locations and polygons, defining areas on the globe, and to layer sets of data one over the other, in order to reveal patterns.
Across all the natural sciences, GIS (Geographic Information Systems) in particular has become an indispensable tool. And if individual researchers do not have access to the software via an educational institution, then there is also a free version, QGIS.15 The welcome ubiquity of this tool is a sign of the enhanced importance given to space and place in all forms of analysis.16
Historians who are taking their first steps down this road can meanwhile find simpler solutions. Google Earth17 makes possible the mapping of most types of simple data, while sites such as BatchGeo18 allow lists of addresses to be turned into specific geo-references, which can then be mapped. Once again, too, The Programming Historian has several tutorials that can help.
Undoubtedly, one of the great advantages of working with geographical data is that there is simply a huge amount of it out there; and much of it is free to use. Geographical data can also be exported to a CSV format and analysed in combination with other varieties of information.
Historical maps are particularly vital sources of geographical data. These have long been used and exploited by historians, but are now increasingly available online, and in a ‘warped’ format – that is, stretched to reflect existing geography. A site such as Old Maps Online19 gives access to over 400,000 historical maps, held in over forty-five libraries and archives. Most can also be viewed as an overlay on a modern map, and many can be imported into Google Earth for further manipulation. As a result, sensitivity to place can now be readily incorporated into historical analysis, not as an optional extra but as integral.
10.6 The delights and dangers of data
Digital history provides massively useful tools for historians. But there’s no need to become a computer scientist in order to wield these tools. And there’s no need to fear them either. Adding a map, citing a word frequency or including a graph simply makes a good historical study all the more compelling.
Indeed, mastering at least some digital tools is a liberating experience. New applications become apparent. New connections between old sources and new techniques are made possible. A list suddenly suggests itself as a map; a set of texts becomes a corpus; and a collection of letters, a virtual network.
Remember, however, that history-writing is a craft. Digital techniques help towards making useful steps on the journey. They are not ends in themselves. Historians strive to inform and influence their readers, not to stun or dazzle them with methodology.
Easily processed data, resulting in ‘shock and awe’ graphics, is seductive. It is easy to be impressed by the sheer scale of some datasets, and to assume that any patterns revealed must be significant. But researchers must always remain wary. There are biases within all sources which have been digitized. And the historian’s first task remains to identify and to make interpretative allowances for such in-built biases. Otherwise, the smart new techno-data is simply magnifying (and possibly obscuring) old limitations within the data.
To take a single example, an exciting project like Mapping the Republic of Letters20 is based on 20,000 letters, written by major figures of the European Enlightenment. However, the choice of letters for digitization was determined by what was already available in print (largely early to mid twentieth-century publications). Hence, editorial decisions, made in the 1950s and earlier, have been given new currency; and the racial and gender biases (whether implicit or sometimes explicit) of a long-superseded scholarship have been reinforced. What appears at first sight to be new research at the cutting edge of technology frequently turns out to reproduce older, Euro-centric histories. Historians should therefore cultivate the skill of interrogating silences in the archives – and looking hard for alternative forms of evidence.21
In addition, search algorithms often incorporate unstated biases, such as outdated assumptions about ‘race’. One popular tool for identifying patterns in large corpora is called ‘Mallet’. It identifies ‘topics’ using a complex algorithm that measures word frequency and co-location. The problem for all researchers is that Mallet – and ‘topic modelling’ in general22 – works as a black-box system, which obscures its working criteria. Their results can appear compelling, producing lists of seemingly related words, which may then be identified as ‘topics’ by researchers. But few understand the algorithms involved, so that most accept the results without really understanding how they were generated.
Ultimately, the best and only response for researchers is to know their sources; and to know what questions they are posing – and why. Those crucial requirements are more vital than ever when the data is being fed en masse into machines, whose workings are frequently opaque. In the current media-dominated world, the bombardment of data and images can become vertiginous. Indeed, cultural critics like Jean Baudrillard worry that not just historians but all citizens will become confused into a state of ‘hyper-reality’, indistinguishable from the real thing.23 Hence everyone should cultivate a good countervailing sense of scepticism. And all should avoid the trap of uncritical techno-worship. That danger was noted by Lewis Mumford in 196224 when he warned that:
minds unduly fascinated by computers carefully confine themselves to asking only the kind of question that computers can answer and are completely negligent of the human contents or the human results.
In other words, historical researchers, and not their research tools, are the ones who produce research – and the ones who bear responsibility for the outcomes.
1 For overviews, see Levenberg, Neilson and Rheams (ed.), Research Methods for the Digital Humanities; Milligan, History in the Age of Abundance; Salmi, What Is Digital History?; and Crymble, Technology and the Historian.
2 <https://programminghistorian.org> [accessed 30 April 2021].
3 See J. Holmes, Gendered Talk at Work: Constructing Gender Identity through Workplace Discourse (Oxford, 2006); and context in J. M. Bennett, History Matters: Patriarchy and the Challenge of Feminism (Philadelphia, Pa., 2006).
4 <https://books.google.com/ngrams> [accessed 30 April 2021].
5 For Adobe Acrobat, see: <https://www.adobe.com/uk/> [accessed 25 March 2022].
6 S. Tanner, T. Muñoz and P. H. Ros, ‘Measuring mass text digitization quality and usefulness: lessons learned from assessing the OCR accuracy of the British Library’s nineteenth-century online newspaper archive’, D-Lib Magazine (2009): <https://doi.org/10.1045/july2009-munoz> [accessed 30 April 2021].
7 <https://www.hathitrust.org> [accessed 30 April 2021].
8 <https://voyant-tools.org> [accessed 29 April 2021].
9 <https://www.data-archive.ac.uk> [accessed 30 April 2021].
10 <https://zenodo.org> [accessed 30 April 2021].
11 <https://openrefine.org> [accessed 30 April 2021].
12 G. Argyrous, Statistics for Research: With a Guide to SPSS (London, 2005).
13 <https://gephi.org> [accessed 30 April 2021].
14 <https://boostlabs.com/blog/ibms-many-eyes-online-data-visualization-tool> [accessed 30 April 2021].
15 <https://qgis.org/en/site> [accessed 30 April 2021].
16 Some historians have always been concerned with space and place; but for the recent enhanced interest, see T. Zeller, ‘The spatial turn in history’, Bulletin of the German Historical Institute, xxxv (2004), 123–4; B. Worf and S. Arias (ed.), The Spatial Turn: Interdisciplinary Perspectives (London, 2009); R. T. Tally, Spatiality (London, 2013).
17 <https://earth.google.com/web> [accessed 30 April 2021].
18 <https://batchgeo.com> [accessed 30 April 2021].
19 <https://www.oldmapsonline.org> [accessed 30 April 2021].
20 <http://republicofletters.stanford.edu> [accessed 30 April 2021].
21 On missing or concealed evidence, see Johnson, Fowler and Thomas, The Silence of the Archive.
22 See M. R. Brett, ‘Topic modelling: a basic introduction’, Journal of Digital Humanities, ii (2012): <http://journalofdigitalhumanities.org/2-1/topic-modeling-a-basic-introduction-by-megan-r-brett> [accessed 30 April 2021]; and Leetaru, Data Mining Methods.
23 R. G. Smith and D. B. Clarke (ed.), Jean Baudrillard: From Hyper-Reality to Disappearance – Uncollected Interviews (Edinburgh, 2015).
24 L. Mumford, ‘The sky line “Mother Jacobs home remedies”’, The New Yorker, 1 Dec. 1962, 148.