Academic Research Data Analysis

All the Metadata’s Stage: Text Mining and Network Analysis of 17th–19th Century British Comedy Titles

November 30, 2025 · 3 min read

About the Project

Literary fiction, as Irish novelist Sydney Owenson observed in A National Tale (1814), often reflects the “mirror of the times” in which it is produced, capturing the morals, customs, peculiarity of character, and prevalence of opinion. The field of literary history has long sought to make this mirror visible by historicizing the relationship between literary works and the broader socioeconomic forces shaping their production.

Moving beyond close readings of “canonical” texts, literary historians such as Franco Moretti have contributed significantly to digital humanities methodologies by demonstrating that this relationship can also be examined at scale through distant reading. Distant reading allows texts to be analysed across genres, systems, and large textual corpora, shifting attention from individual works to patterns that emerge across them.

While Moretti emphasizes the usefulness of digital databases for large-scale analysis, he also cautions that working with large amounts of data differs fundamentally from traditional literary interpretation: “texts are designed to ‘speak’ to us… but archives are not messages that were meant to address us, and so they say absolutely nothing until one asks the right question” (Moretti 2013, 125). Bibliographic metadata constitutes precisely this kind of archive—a structured but non-expressive body of information that does not “speak” on its own, but becomes analytically productive when approached with clearly formulated research questions. Rather than offering narrative meaning, metadata records the material conditions of literary production, including authorship, genre, publication dates, places of printing, and paratextual elements such as titles and editions. Examining these features at scale enables a distant reading of literary forms as historical artifacts and systems, rather than as isolated works.

In this light, the present project adopts a distant reading approach through exploratory data analysis of the British Library’s bibliographic dataset of digitised British drama from the 17th to 19th centuries. It employs Python (using the pandas library for data analysis and plotly for data visualisation), Gephi for network analysis, and the Wikidata API for data enrichment. The study begins by mapping broad patterns in the dataset before focusing on the comedy genre in particular.

This study was developed as part of theIntroduction to Digital Humanities course taught by Margherita Fantoli within the MSc Advanced Masters of Digital Humanities (2025-26) at KU Leuven. The first part of the project involved the creation of a collaborative data wrangling and cleaning pipeline using OpenRefine [available here].

In the second part, the study proceeds in four stages. First, descriptive statistical methods and visualizations are applied to establish a sustained overview of the dataset for both the researcher and the reader. Second, the exploratory data analysis focuses on the comedy genre to examine temporal trends in publications, spatial networks of publishing locations and authorship distribution.Third, it uses natural language processing to examine the lexical patterns in comedy drama titles. Finally, the network analysis constructs a syntactic dependency network of frequent lemmas in comedy titles to provoke a central question inferred from these exploratory observations: How do 17th–19th century British comedy titles encode gendered roles and archetypes, and what do they reveal about the social expectations of the period?

GitHub Repository

GitHub -All the Metadata’s A Stage: Exploratory Bibliographic and Text Network Analysis of British Comedy Dramas Across 17th-19th Centuries Using OpenRefine, Python & Gephi