Chapter 11 Challenging the pipeline structure: a reflection on the organisational flow of interdisciplinary projects

The use of new terminologies in research projects, especially when it comes to interdisciplinary ones, is expected and, sometimes, desirable. It helps scholars to develop better terms to refer to new phenomena, processes and methods, as well as to improve communication across disciplines. This is the case of an expression that has become increasingly common in the field of digital humanities: ‘pipelines’.

Pipelines refer to ‘a very long large tube, often underground, through which liquid or gas can flow for long distances’ (‘Pipeline’, n.d.). When used in the context of digital humanities (DH), however, it is a metaphor, borrowed from data science. In computing, the term ‘pipeline’ is used to describe the processes of data transformation from acquisition to output – the expected outcome. In this sense, data is thought of as a fluid that flows through ‘pipes’, which are different stages of the processes, such as data acquisition, filtering, cleaning and visualisation, among others.

When designing research projects in DH, which very often involve a ‘data pipeline’, it becomes useful for scholars to think of the project’s structure as pipelines themselves. It seems to be a natural and easy way of dividing the work of multiple collaborators, where someone is expected to focus on data acquisition and data management, for example, while others focus on data analysis. When we think of a truly interdisciplinary project, however, things become even more complex. This chapter questions whether ‘pipelines’ are the most suitable form of organising an interdisciplinary project in DH and why a pipeline model of project organisation can fail.

Data pipeline composed of five steps, represented linearly one after the other. Steps are, in order, data input, stage one cleaning data, stage two filtering data, stage three reformatting data, and data output. — **Figure 11.1:** Example of a data pipeline. © Caio Mello.

Being truly collaborative, and truly interdisciplinary

The emergence of projects that bring together computer scientists and humanists has become increasingly frequent and encouraged by funding agencies. In this scenario, a common challenge is to define how each of the project’s members can contribute to the project’s aims by being part of a specific stage of the pipeline. An easy, and problematic solution for this challenge, is simply relying on computer scientists to focus on the data collection and processing steps, while humanists would contribute to the data analysis by interpreting the outputs. This form of organisation, however, brings some issues.

Van Atteveldt, Trilling and Calderón (2022) present the concept of pipeline by using an example where a researcher is interested in testing a hypothesis about ‘personalisation in the news’. For this task, the authors design a pipeline that involves collecting – or scrapping – news articles published online, processing the data using Named Entity Recognition (NER), and verifying to what extent people mentioned in the texts influence the news stories. The authors then divide the pipeline into two parts, one that they believe is of interest, the final stage of verifying the hypothesis, and another that is classified as ‘necessary but not inherently interesting’ to them, the use of NER. In this case, although it is important to consider potential biases involved in the use of Natural Language Processing techniques, they understand that, as their work is not interested in studying NER per se, it should be taken just as a tool to be used to respond to their questions.

This approach to the use of digital methods can be easily transferred to the design of a research project resulting in a series of problems. Interpreting the result of a tool that researchers are not able to deeply understand, becomes, in a project pipeline, the same as receiving data processed by another person or group, and expecting to be able to retrieve meaningful insights out of the interpretative work. As pointed out by Viola (2023, 59), from data collection to data processing, the curatorial work is paramount, as each decision ‘triggers different chain reactions and will therefore output different versions of the material’. Dividing researchers between a more ‘technical’ and a more ‘interpretative’ workgroup not only causes inefficiency in the project but can also create errors.

Fry (2007, 6) refers to the division of work on data in different segments as a ‘telephone game’, where something is always lost at each stage. Fry’s work is particularly focused on the work of designers to represent data: if they are not involved in the early stages, they become unable to display a visualisation that accurately seeks to respond to the initial question stated by the project.

Being truly collaborative and truly interdisciplinary is challenging. It requires devoted time for scholars to reflect on how they can work together, translate specific terminology to colleagues with different backgrounds and find common terms to be used in the group work. Scarce time, resources and appropriate knowledge to conduct this reflection is, unfortunately, commonplace.

Ahnert and others (2023, 17) discuss how researchers have created space for these reflections on the project ‘Living With Machines’. They considered the importance of providing researchers with the possibility to give their input in every stage of the project, aiming at responding to a common historical research question, while they should also have space to produce outputs that would benefit their particular careers in different fields of research. The authors highlight the importance of ‘knowing what your data looks like, and understanding how it will be manipulated – what will be lost or added, where ambiguity may harden into certainty – before it is analysed in a specific research context’ (Ahnert et al. 2023, 44).

Some of the issues related to the separation of work across a pipeline might not be directly caused by the pipeline structure itself, but I argue that the linearity of that structure reinforces the difficulties found when seeking to produce a more collaborative environment.

A critique of ‘pipelines’ as a representation of research planning

One way of looking at the use of pipelines as a figurative representation of processes is their use to illustrate the steps of data processing, as discussed above. In this case, for example, developers present a pipeline to inform how a certain tool works, through which the audience is capable of identifying each of the steps undertaken that lead to the results. When using this logic, in advance, to design a research project, pipelines are not part of the final stages – a representation of results – but rather a representation of the project’s structural planning.

The challenge is then to avoid interpreting the project’s pipeline by the logic of Taylorism, principles of labour developed by Frederick Taylor in 1911, based on the segmentation of the workforce (‘Taylorism’ n.d.). Among the consequences of having researchers working on a linear pipeline is the risk of having the interpretative work projecting their own beliefs and expectations onto the data outputs generated by another person or workgroup, especially when the data processing is a black box.

Moreover, it can also create obstacles to dealing with failure more efficiently. Dealing with failure requires, very often, the knowledge of the data, the context in which it was produced, the lenses through which it has been observed, and by which research questions it is motivated. Besides identifying failure, it is necessary to have the means to address it. That’s when a rigid linear pipeline may cause difficulties, as it does not necessarily allow the time, resources and possibility of returning to previous steps multiple times to refine and correct them.

When thinking of a more collaborative project design, we can think beyond linearity, emphasising the necessity of integration. This way, taking as an example a data pipeline such as the one proposed by Van Atteveldt, Trilling and Calderón (2022), the participation of social scientists and humanists should be expected in the design of the data collection and data processing steps, including the work of interrogating the tool, in this case, NER. I also see as desirable the participation of data scientists in the interpretative work, which can help validate whether the output corroborates (or doesn’t) the formulated conclusions.

Beyond that, thinking of non-linearity also means expecting to re-collect and re-process data more than once, or even, several times. Understanding that processes do not need to occur necessarily one after the other – and that they need to be integrated – implies a different way of understanding the role that each scholar plays in the project. The participation of the entire team in the early stages of decision-making is crucial as it raises awareness of how results are affected by each stage of the process.

Conclusion: distorting the pipes

If fluids in a linear pipeline flow in just one direction, it would be a case of distorting the pipes. The idea is to create space, within the project planning, for a collective construction that seeks less linearity and more of an integrated system. In the pipe metaphor, this space is a call for more time to be dedicated to collective planning, which can hopefully mean the construction of better tools for dealing with failure.

The pipeline model of project organisation increases opportunities for failure because individuals do not have the necessary knowledge of the other processes, the data and the choices made by collaborators. Moreover, this model has the potential to make addressing failure more difficult because it discourages iterative working between sections of the pipeline.

Truly collaborative and interdisciplinary projects are challenging and require effort from the scholars involved to devote time and resources to reflect on how to make it work. A critical look at how pipelines have been used and their potential to confine researchers into rigid processes is, therefore, a useful exercise to produce more efficient project planning that will lead to better results.

References

Ahnert, Ruth, Emma Griffin, Mia Ridge and Giorgia Tolfo. Collaborative Historical Research in the Age of Big Data: Lessons from an Interdisciplinary Project. Cambridge University Press, 2023. https://doi.org/10.1017/9781009175548.
Fry, Ben. Visualizing Data: Exploring and Explaining Data with the Processing Environment. O’Reilly Media, 2007.
‘Pipeline’. n.d. In Cambridge Dictionary. Accessed 20 June 2024. https://dictionary.cambridge.org/dictionary/english-portuguese/pipeline.
‘Taylorism’. n.d. In Munich Business School Dictionary. Accessed 20 June 2024. https://www.munich-business-school.de/en/l/business-studies-dictionary/taylorism.
Van Atteveldt, Wouter, Damian Trilling and Carlos Arcila Calderón. Computational Analysis of Communication. John Wiley & Sons, 2022.
Viola, Lorella. The Humanities in the Digital: Beyond Critical Digital Humanities. Springer Nature, 2023.

Chapter 12 When optimisation fails us

Show the following:

Adjust appearance:

Notes