6. Managing masses of data

SIGNPOST: Relevant to all researchers

The ‘Behemoth’ is a terrifyingly powerful mythological beast, signifying primal chaos, from the Biblical Book of Job (40:15–24).

6.1 The flood of information

Gustave Flaubert observed in 1852 that ‘Writing history is like drinking an ocean and pissing a cupful’.¹ And he was right. Every big history project is an exercise in controlling a flood of notes and documents. Flaubert, whose evocative novels were steeped in history, knew the struggle to control the research Behemoth.² In a world of foolscap and index cards, organizing the information collected in archives and from secondary reading was an individual choice – sometimes an arcane practice passed down from supervisor to student. In part, too, how notes were organized was driven by the kind of source material that was being consulted.

Famously, historian Keith Thomas created a system of note-taking which entailed writing notes on single sides of sheets of paper, entering the bibliographical details of the core source in his personal index book and then cutting the sheets of paper into fragments, which were then allocated to separate envelopes by theme. When an envelope was full to bursting, it suggested a possible topic for a book or essay.³ As Thomas’s work was chiefly based on the collections of the Bodleian Library, this system worked well for him – and allowed him to develop an unusual writing style that juxtaposed many different quotations from different sources.⁴ Other researchers advocate filing notes by subject; or organizing collections of index cards, with keywords.

None of those older systems, however, are fully adequate for modern-day research. With few exceptions, most archives now encourage researchers to take photographs of original documents, while many primary sources are most easily consulted online; and the vast majority of secondary readings, particularly articles, come in the form of PDFs (files in portable document format). In the wake of the Covid-19 pandemic (2020), Britain’s National Archives at Kew even stopped researchers from bringing pencil and paper into their search rooms.

All of this makes it much easier to accumulate ever-larger bodies of evidence, but it also poses a real challenge of how to order the flood of notes – and how to find a quotation – among a billion words of source material. Researchers have to update continuously in the new ‘digital age’.⁵ There are a dozen technical solutions to this problem, and commercial firms eager to sell a data management package. But before committing to any one system, it’s best to think hard about both the nature of the sources that are being used and the kind of history that is being written. Some projects will essentially be based on a single source. It is entirely possible to create a doctorate (or book) on the basis of the published British Parliamentary Papers, or from a single archive. Alternatively, one project can draw upon a dozen different archives, and a hundred different types of source material.

As a result, the chosen research data management system will need to be keyed to the relevant sources. If working from published material, for example, the precise edition and page references will be important, while if working in an uncatalogued collection (there are a surprising number still to be found), it is necessary to develop a system from scratch to describe how to find a particular item.

The first point to remember is that researchers will eventually have to be able to write footnotes or endnotes which allow others to trace the precise research journey. The choice between footnotes at the foot of the page or endnotes at the end of each chapter (or at the end of the book) is often specified by publishers, journal editors and/or examination regulations. Where authors have a free choice, footnotes are generally recommended for scholarly presentations. But ultimately these things are a matter of choice.

Whatever system is chosen, it’s vital to ensure that it will record all the information needed for a citation. Some scholarly websites give helpful pointers as to how each individual document is to be cited.⁶ But not all do so. Researchers must therefore check in every case. If the archival reference is not associated with the correct document title and folio references, for instance, and that reference is then demanded for the final doctoral thesis, researchers will find themselves undertaking days of unnecessary (and frankly boring) labour to recover the missing information.

A second point for researchers then follows: can you find it? It is remarkable how quickly even key details can be forgotten! And coming back to a file of a thousand pages in search of a single quotation is daunting, even assuming that it’s possible to conduct a keyword search. Hence it’s vital for all researchers to find some way of ‘indexing’, or keywording, or tagging the research notes and images.

And a third point: are the notes easy to use when writing? People write differently, but history is always a confection of comment and evidence; and organizing the evidence and ensuring that every quotation is accurate, and every statistic is correct, is key. In other words, researchers should work out how they write and use evidence – and organize references accordingly. And it’s worth adding here that there are two common styles of presenting notes (as well as many variations). The Harvard system locates all bibliographical details of sources at the end of a given piece of work, and cites within the text (in brackets) simply the author surname, date of publication and page references, if needed. It works flexibly and well when citing single published works, but is cumbersome for referencing many sources, and especially for citing non-published sources (documents; art objects; artefacts), which often have lengthy call-marks.

By contrast, the Oxford system (used in this Guide) inserts superscripts in the text at key junctures, and then places all bibliographical information in a running sequence of footnotes or endnotes. These also allow scope for additional commentary from authors, although these days pithiness is in vogue. Either way, readers should be able to track all key sources, whether primary or secondary. Of course, they don’t all do so. But examiners will do – and the option should be open to all comers.⁷

6.2 Identifying files: what’s in a name?

Any long-form piece of history-writing will be based on dozens or hundreds of separate files. Many will be PDFs, but others will be images of original documents; or data might be recorded in spreadsheets and CSV (comma-separated values) files. Or 3D models, maps and KMZ files might be used; or XML, audio and video files. To keep track of all these files, a consistent and ‘human-readable’ system of file names is essential. Most people create a new hierarchical system of folders (backed up to the cloud) that reflects the topics in a bigger project. But an alternative system is to organize by archive or source type. The important thing is that the chosen file names are human readable, clear and distinctive. A file called chapter 2, which holds automatically generated file names – a string of numbers and letters – will be impossible to manage when it comes to organizing research materials. In contrast, a file, perhaps called ‘historiography, chapter 2’, that includes short, readable versions of the titles of individual items – such as Beattie, English Detectives, 2012.pdf ⁸ – is going to be much more helpful. Keywords can also be added to file names – though adding more than two or three is likely to become unwieldy.

Archival photographs are particularly difficult to organize effectively. There are new(ish) systems that allow researchers to rename and organize photographs directly from phone or camera. The leading system for historians at the moment is Tropy.⁹ It’s a system which allows an archival reference and keywords to be used, so that images are automatically renamed. That process makes managing and sharing such images much easier.

However, even without using an image management system, it is important to ensure that absolutely all the information needed to cite a page of photographed manuscript is included. It is worth, for example, creating a separate folder for all the images of a single document, and using the archival reference and title as the folder name. It is also important that the first image taken in a series of images is of the spine, cover or first page of a document, and includes the archival reference.

Equally challenging are online sources. Simply recording a URL (Uniform Resource Locator, or web address) is not good enough. Many sites change and evolve, and the information found one month could well be gone the next. A lot of sites also insist on registering or signing in (or using library sign-in systems like Shiboleth or OpenAthens). This method creates a session URL, but, unfortunately, it can’t be shared with other researchers and won’t work at a later date.¹⁰

Most websites that allow keyword searching also create a ‘search URL’, which includes all the information in a search form. But these URLs quickly become both unreadable and fragile. Many sites will allow researchers to copy or download specific items and objects, and this technique offers one solution. Renaming the files will then help to organize them.

Another solution is to use the Wayback Machine and Internet Archive. This method allows users to ‘archive’ a specific website, as it appears on the day it was consulted. The new URL can then be saved, allowing researchers to be certain of finding (and being able to cite) the original page, as it was when originally consulted. The Wayback Machine is the closest thing available to an ‘archive’ of the internet.¹¹

6.3 Using data management tools: do I, or don’t I?

There are innumerable data management tools out there that are worth considering. Endnote¹² and Envivo¹³ are two of the most popular commercial packages. However, they cost money – and once researchers start using a package, they may find that they are caught in a system that runs into trouble. Software companies go bust; packages stop being supported; and apparently secure data is suddenly impossible to access. Most packages will allow the export of data to other systems, but before committing to any single package, it’s crucial to ensure that escape can be made, if need be, with all data intact.

The advantages with data management tools can be huge, and many historians swear by them. They allow clear labelling of all items, matched with the capacity to organize them and insert relevant keywords. They also facilitate complex searches and come with a range of in-built tools – for visualizing and analysing the mass of notes. Most usefully of all, most data management systems will allow the creation of automatic links to Word documents, enabling the generation of footnotes and citations directly from stored research materials.

Several alternatives are free to users. The most popular is Zotero.¹⁴ This service was created by the Center for History and New Media at George Mason University, Virginia, USA; and it is one of a series of tools, including Omeka and Tropy, which are designed specifically for historians and to aid historical research. Zotero allows users to create a free account, and either to develop their own bibliography or to share themed bibliographies created by others. It also allows users to upload PDFs to their accounts, and then to search them; and to relate these items to the full text in Google Books.

What’s more, through a series of plug-ins, researchers can then automatically generate an accurate record of every item read elsewhere. When reading an article online, a simple click of the mouse will generate a full citation and copy the item to each individual’s Zotero account. Later, selected items, or indeed the whole bibliography, can be exported to whatever format is needed. And when writing, a plug-in for Word allows researchers to create a footnote – or an in-line citation – by entering the first few letters of an author’s name or item title.

In sum, Zotero is not a comprehensive data management system, but it reproduces many of the functions of expensive commercial packages, and in combination with a well-structured hierarchy of computer files, it makes for a good compromise.

6.4 Finding shorter shortcuts

Once research materials have been assembled, there are also numerous tools available to help in the process of interrogating and understanding them.¹⁵ The tools mentioned below are some of the more popular ones at the time of writing, but new ones emerge regularly, and old ones fall out of fashion.

When working with large amounts of plain text, there are a series of approaches drawn from Corpus Linguistics,¹⁶ which historians are increasingly using. These allow the study of word frequency and context, and they rapidly identify and assess relevant phrases from a large ‘corpus’ or set of texts. The most popular currently is Voyant Tools,¹⁷ which enables users to upload text and analyse it in real time.

In one pioneering example of systematic data searching, the Japanese social historian Kazuhiko Kondo combed through thousands of eighteenth-century publications. He used a combination of digitized resources including EEBO (Early English Books Online),¹⁸ ECCO (Eighteenth-Century Collections Online)¹⁹ and MOMW (the digital economics library entitled Making of the Modern World).²⁰ He was seeking to test a specific query, in two parts. Was the phrase ‘moral economy’ used in publications during this period? (Answer: Yes.) And did it have the polemical meaning attributed to it by the British Marxist historian E. P. Thompson? (Answer: Yes, but only partially.)²¹ Kondo’s findings thus helpfully illuminated a rumbling historical debate. Some eighteenth-century commentators (but definitely not all) did indeed appeal to a ‘moral’ alternative to what they denounced as a worldly and amoral ‘political economy’.

Different researchers choose different combinations of search methodologies, as suitable for each specific task. Overall, meanwhile, it’s a reasonable prediction that Corpus Linguistics will continue to gain in relevance and popularity – and that the technical speed and sophistication of such approaches will continue to grow.

For all those working with structured data – in the form of spreadsheets or as CSV files, or in a database – there are tools like OpenRefine.²² These tools allow researchers to re-organize and analyse their data with great rapidity. And the power of spreadsheet packages like Excel, for the analysis and translation of data, should not be underestimated. Between macros and plug-ins, Word and other word-processing software can also act as powerful text management systems.

Jupyter Notebooks²³ are also increasingly being advocated as a way of managing the analysis of research materials. These allow researchers to incorporate data, along with elements of computer code, both to develop and to clean research material; and to run more complex processes, without researchers having to write code themselves. When working with large bodies of structured data, such systems are worth the effort of learning how to use them.

But even for those who aren’t comfortable with complex online and computer-based systems, there are still a lot of small ‘cheats’ that will make writing history easier. Simply using control-F in a browser will save readers from monotonous skim-reading when searching through a large text; and many bibliographies support automated citation services. The Bibliography of British and Irish History (BBIH), for instance – a key research tool for all working in the areas of British and Irish history²⁴ – allows users to right-click on any item and generate a citation in any of the most common formats, which in turn can be cut and pasted into a footnote. One of the nice things about working on larger projects is that it is worth the time spent learning new packages, and new tricks. And that’s precisely because researchers are then investing two or three or more years in getting it right.

6.5 Summary: losing data – and finding it again

At some point, historians find themselves staring at a computer screen, and the perfect quotation to support a complex argument will rise to mind. Memory insists that it’s been read somewhere significant. It was on the upper-left-hand side of the screen – or perhaps it was in a large green book, about two-thirds of the way through – but certainly somewhere. Many a long afternoon can be spent trying to locate the quotation. And, when it’s found, it often turns out not to be as perfect as was imperfectly remembered.

Good data management allows researchers to organize their thoughts as they proceed, as well as to organize their writing; and, above all, to think more critically about what the project is actually trying to achieve. With the right tools, researching and writing becomes infinitely more efficient, and infinitely more fun.

Overall, the system chosen by each researcher does not need to be complex, or technically challenging. It just needs to be suitable for the task in hand – and it will keep those wasted afternoons, looking for that perfect quotation, down to a minimum.

1 Gustave Flaubert (1821–80): ‘Il faut boire des océans et les repisser’, letter to Louise Colet, 8 May 1852, in G. Flaubert, Correspondance, ed. J. Bruneau (Paris, 1972–98), vol. 2, p. 86.
2 See the illustration that opens this chapter.
3 K. Thomas, ‘Diary: working methods’, London Review of Books, xxxii (10 June 2020).
4 See K. Thomas, Religion and the Decline of Magic: Studies in Popular Beliefs in Sixteenth- and Seventeenth-Century England (London, 1971); Man and the Natural World: Changing Attitudes in England, 1500–1800 (London, 1983); The Ends of Life: Roads to Fulfilment in Early Modern England (Oxford, 2009); In Pursuit of Civility: Manners and Civilization in Early Modern England (London, 2018).
5 D. J. Cohen and R. Rosenzweig, Digital History: a Guide to Gathering, Presenting and Preserving the Past on the Web (Philadelphia, Pa., 2006); L. Levenberg, T. Neilson and D. Rheams (ed.), Research Methods for the Digital Humanities (London, 2018); I. Milligan, History in the Age of Abundance: How the Web Is Transforming Historical Research (London, 2019); R. Risam, New Digital Worlds: Postcolonial Digital Humanities in Theory, Practice and Pedagogy (Evanston, Ill., 2019); H. Salmi, What Is Digital History? (Cambridge, 2020); and A. Crymble, Technology and the Historian: Transformations in the Digital Age (Urbana, Ill., 2021).
6 See, eg, the Old Bailey Proceedings Online: <https://www.oldbaileyonline.org/index.jsp> [accessed 24 January 2022].
7 Grafton, The Footnote; F. A. Burkle-Young and S. R. Maley, The Art of the Footnote: the Intelligent Student’s Guide to the Art and Science of Annotating Texts (London, 1996).
8 Indicating J. M. Beattie, The First English Detectives: the Bow Street Runners and the Policing of London, 1750–1840 (London, 2012).
9 <https://tropy.org> [accessed 30 April 2021].
10 For a detailed discussion of these issues, see T. Hitchcock, ‘Judging a book by its URLs’, <http://historyonics.blogspot.com/search/label/URL> [accessed 22 Dec. 2020].
11 Created 24 Oct. 2001 by a non-profit organization in San Francisco: see <https://archive.org/web> [accessed 29 April 2021].
12 <https://access.clarivate.com/login?app=endnote> [accessed 29 April 2021].
13 <https://enviosystems.com> [accessed 29 April 2021].
14 For Zotero, ‘Your Personal Research Assistant’, see <https://www.zotero.org> [accessed 29 April 2021].
15 For the importance of clarifying methodologies and sampling systems, see K. H. Leetaru, Data Mining Methods for the Content Analyst: an Introduction to the Computational Analysis of Content (New York, 2012); G. Schiuma and D. Carlucci, Big Data in the Arts and Humanities: Theory and Practice (Boca Raton, Fla., 2018); K. Leetaru, ‘Big data revolutions will be sampled: how “big data” has come to mean “small sampled data”’, Forbes Magazine, 17 Feb. 2019: <https://www.forbes.com/sites/kalevleetaru/2019/02/17/the-big-data-revolution-will-be-sampled-how-big-data-has-come-to-mean-small-sampled-data/#27f3e310199e> [accessed 29 April 2021].
16 Corpus Linguistics is defined as the study of linguistics based upon analysis of ‘real-world’ texts in their authentic state.
17 Voyant, ‘See through your text’: <https://voyant-tools.org> [accessed 29 April 2021].
18 <https://eebo.chadwyck.com/home>, moved to new proQuest website in late August 2019: <https://www.proquest.com> [accessed 29 April 2021].
19 <https://quod.lib.umich.edu/e/ecco> [accessed 29 April 2021].
20 Making of the Modern World is an online collection, based upon the Goldsmiths’-Kress Library of Economic Literature, 1450–1850, expanded to include many other economic documents to 1945: see <https://www.gale.com/intl/primary-sources/the-making-of-the-modern-world> [accessed 29 April 2021].
21 See both K. Kondo, ‘ “Moral Economy” retried in the digital archives’, Bulletin of the Graduate School of Humanities & Sociology, Rissho University [Japan], xxxv (March 2019), 21–36; and original essay by E. P. Thompson, ‘The moral economy of the English crowd in the eighteenth century’ (London, 1971), repr. in E. P. Thompson, Customs in Common (London, 1991), pp. 185–258.
22 OpenRefine offers itself as ‘a powerful tool for working with messy data’: <https://openrefine.org> [accessed 29 April 2021].
23 <https://jupyter.org> [accessed 29 April 2021].
24 Hosted by University of London’s Institute of Historical Research: <https://www.history.ac.uk/publications/bibliography-british-and-irish-history> [accessed 29 April 2021].

PART II Writing, analysing, interpreting

Show the following:

Adjust appearance:

Notes