Boy Meets Girl Meets Natural Language Processing: Binary Text Classification of RomCom Movies Using Movie Synposis
About the Project
One of the foundational scholarly projects of the digital humanities has been to leverage the possibilities offered by computing technologies to develop critical tools, infrastructures, and methodological pipelines that enable new forms of inquiry within the humanities. The increasing convergence between natural language processing (NLP) and digital humanities reflects a shift on the NLP side toward genres beyond newspaper and newswire texts towards biomedical corpora, forum posts, and social media data, where computational linguists have increasingly turned their attention to historical texts and other textual forms of interest to the humanities and social sciences.
Developed for the Computational Linguistics course taught by Tim van de Cruys as part of the Advances Master’s in Digital Humanities at KU Leuven (2025-26), this project engages with one such textual data, that is, the metadata produced by the social film platform Letterboxd. Its large-scale collection of movie metadata, including synopses, tags, and user-generated reviews, constitutes a rich corpus for computational text analysis. From a digital humanities perspective, Letterboxd offers a valuable site for examining film genres, audience reception, fan discourses, and curatorial practices such as review-writing in contemporary digital culture.
The scope of this project is to perform binary text classification of movie genres based on plot synopses of films released in the twenty-first century (2000–2025). Text classification can be defined as class of supervised machine-learning methods that assign predefined categories (two in the case of binary categories) to textual data using computational classifiers. Within both computational linguistics and digital humanities, text classification is a well-established methodological approach.
While genre classification has been extensively explored in NLP, most existing studies focus on multi-label genre prediction or broad genre taxonomies, often using film scripts or large plot datasets. A smaller body of work has examined movie genre classification using plot summaries specifically (e.g., Blackstock and Spitz 2008; Abimbola 2020; Kumar et al. 2022), yet these studies tend to view genre as a predictability classifier, rarely engaging with genre as a culturally constructed category that also includes subgenres like romantic comedy, horror-comedies,documentary-drama, etc.
This project focuses on romantic comedy for binary text classification using statistical and neural paradigms in machine learning. Despite its highly commercialized and conventional narrative structures and its canonical cultural significance in popular cinema, this subgenre has, to the best of my knowledge, not received focused attention as a standalone classification task. This project isolates the romantic-comedy (rom-com) genre and frames genre detection as a binary classification task—distinguishing rom-coms from non-rom-com films using synopses of movies released from 2000 to 2025 in the Letterboxd dataset