2025

The past decade witnessed a computational turn in social sciences. When Lazer et al. (2009) talk about the potentials of computational methods in social science in their well-known article in Science, they observed little evidence of computational social science in leading social science journals. Today, computational methods have become a standard tool within the repertoire of social scientists. Numerous computational works are published in social science outlets every year, a significant number of research centres have been founded, and computational methods are routinely incorporated into many universities’ curricula across the world.

It is important to clarify that, by computational social science, I do not mean simply using computers for social science work. The use of computers is not new in the social sciences. As early as the 1970s, Rosenstone and Wolfinger (1978), for example, were using mainframe computers to estimate statistical models (cf. Alvarez 2016). Since the 1980s, statistical packages such as SPSS and Stata have allowed social scientists to run statistical tests on personal computers. They have now become standard applications for universities around the world. What is new is the data we use and the approaches we employ. In Bit by Bit: Social Research in the Digital Age, Salganik (2018) succinctly summarizes the social context behind the rise of computational social science. On one hand, we witnessed the so-called rise of “big data” as a rapidly increasing amount of digital information has become available, most notably online data such as social media posts, websites, and search engine logs. On the other hand, we have also experienced a rapid growth in access to computing power, allowing researchers to run large-scale estimations that would otherwise have been impossible in the past. However, “big data” is not simply about being “big”. In the past, most of the data we used were generated for the purpose of doing research, such as surveys and interviews. For computational social science, the data we use, more often than not, are created for purposes other than research, such as generating profit and providing services, but are repurposed for research.

In addition, as Gary King (2016) has noted, “big data” is not about the “data” either. It is about the methodological innovations provoked by the volume and form of data. Similar to the repurposing of data, many tools used in computational social science are also not designed for research but repurposed for research, mostly from the field of computer science. Take API (Application Programming Interface) as an example. The main purpose of API is to provide an interface for computer programmes to interoperate, yet it has been used by researchers to harvest data from social media platforms (Jünger 2021). In this sense, computational social science is best defined as the approach that uses tools and techniques from computer science to conduct social science research.

From Trump’s election campaign on X (Twitter) to the mobilizations of independence movements on Facebook, many nationalist phenomena have developed an important digital aspect. While the digitalisation of nationalism has provided nationalism researchers with an unprecedented amount of digital information, the massive volume of these data often makes many traditional research methods infeasible. Against this background, computational methods offer viable options to analyse a large volume of data in an efficient way. In this article, I will offer a brief summary of works that employed computational social science methods for the study of nationalism and give an account of what computational social science methods can contribute to nationalism studies. I will end this article with a discussion of the best practices for computational nationalism research.

Why computational methods?

It is perhaps an exaggeration to claim that “[n]ationalism studies does not seem to be a very innovative field of research” (Storm 2018, p. 113), but the field indeed does not seem to be the cradle of new methodologies. This is not to suggest nationalism researchers are less innovative, but more to do with the concern of the field. For a large body of (especially early) literature, the goal is the address the question of “when is the nation”, and historical materials are obvious data sources. The same applies for the study of concepts such as memory and ethnie, which require researcher to look through the history. In many of these cases, qualitative and analytical document analyses are best suited to conduct the study. However, this does not mean computational methods have no place in historical studies. With the ever-growing volume of digitized historical documents, many computational studies are able to bring a historical lens to their research (cf. Marjanen et al. 2019).

Recently, the field of nationalism studies underwent rapid development, as Bochsler et al. (2021) point out, the recent years witnessed a notable increase in the use of quantitative techniques. Since the introduction of the so-called “Moreno queststion” (Moreno 2006), it has become a standard tool for quantitative measurement of national identity. Nationally representative surveys, such as the General Social Survey (GSS) and Eurobarometer, regularly include question items to measure national identity and national pride. Bonikowski and DiMaggio (2016), for instance, use latent class analysis on a broad set of measures from the GSS to investigate the varieties of nationalist attitudes in the US. It is also important to point out that the use quantitative (and computational) techniques do not necessarily replace qualitative work. For example, Nelson (2000) incorporates ethnography and surveys to study consumer nationalism in Korea. Ho (2022) introduces a novel framework that combines topic modelling, a computational technique (further discussion later), and discourse analysis to study Hong Kong nationalism.

The classic literature of nationalism studies have long recognised legacy media such as newspaper and television as a crucial factor in inculcating and sustaining nationalism (Anderson 1991; Billig 1995; cf. Skey 2022), but the recent success of right-wing nationalists across the world, most notably the success of Donald Trump in the US presidential election and the Brexit campaign in the United Kingdom, and their strong online presence have prompted researchers to focus on the role of digital media in nationalism. For examples, Mihelj and Jiménez-Martínez (2021) explores three key mechanisms of the internet that sustains nationalism: “the architecture of the internet domain name system, the bias of algorithms embedded in digital platforms and the formation of national digital ecosystems” (p. 335). Schneider (2018) explores how ICTs shape Chinese digital nationalism through search engines filtering, algorithmic bias on digital platforms, and the affordance for constructing historiography.

While it is exaggerated to proclaim a “digital turn” in nationalism studies, nationalism scholars are starting to incorporate digital media and digital data into their toolkit. With the advent of the internet, many new sources of naturally occurring data have become available for researchers. These include not only digital and digitised texts such as news articles, party manifestos, and political speeches, but also digital traces on social media. The richness of the data allows researchers to unpack the complex, nuanced, and multi-layered nature of nationalism (Bochsler et al. 2021), but they also pose new challenge as their massive volume have made many traditional research methods inapplicable and unmanageable (Benoit 2020). In response, some pioneering work adopted computational techniques to gain insight from these data sources. In the following section, I will provide an overview of the state of this stretch of work.

Current State of Computational Nationalism Studies

Keyword and Lexicon-based Approaches

In response to the proliferation of digital text, a significant strand of research has emerged using computational text analysis (also referred to as text as data, text mining, and automated content analysis). At its essence, computational text analysis involves counting words and word co-occurrences. In its simplest form, researchers can use computer programs to count the occurrences of specific keywords that are assumed to reflect certain elements of nationalism. This procedure is similar to content analysis, hence the name “automated content analysis”. Computational text analysis has also been employed to address questions related to nationalism. For instance, Fong (2017) uses a set of keywords to measure the appearance of “defending Hong Kong” discourse in digitized news articles. Bonacchi (2022) employs a set of keywords to capture expressions pertaining to the Iron Age, Roman, and medieval periods of Britain and Europe. Ho (2023) uses a lexicon of country names extracted from the Common Locale Data Repository Project to capture mentions of foreign countries on Facebook.

One common use of lexicons in computational text analysis is sentiment analysis. Sentiment analysis involves a range of techniques to measure people’s opinions, sentiments, and emotions toward entities (Liu 2020). Originally developed for categorizing online product reviews, researchers have quickly applied it to other research questions, such as detecting signs of mental illness and gauging public opinion toward policies and politicians. A popular way to conduct sentiment analysis is by employing “off-the-shelf” dictionaries to capture words associated with different sentiments and emotions (Chan et al. 2021). For instance, in their study on vaccine nationalism, Chester and Shih (2023) applied dictionary-based sentiment analysis to identify the negative portrayal of Western vaccines by Chinese state-owned media. Drawing on a corpus of periodicals published between 1711 and 1822, Koncar et al. (2020) use sentiment lexicons to identify critical discussions of foreign societies, reflecting the emergence of nationalism during the 18th century.

Topic Modelling

Topic modelling is a technique to investigate the latent thematic structure of a given text corpus. For Latent Dirichlet Allocation (LDA) topic models, the most widely used algorithm, a model typically returns a distribution of words over topics and a distribution of topics over documents (Blei 2012). LDA is considered a mixed membership model (in contrast to single membership models), where multiple topics are present in each document, and multiple documents share the same topic.

Mathematically, a topic is a list of words with varying probabilities of appearing in a document of that topic. Words can be shared across topics, allowing the model to capture polysemic meanings (e.g., “bank” in “river bank” versus “investment bank”). In practice, the meaning of a topic must be assessed empirically and interpreted against substantive theoretical concepts.

A key advantage of topic modelling is that topics are estimated from a text corpus without prior input, making it a quick and efficient tool for exploring the thematic structure of large collections of text (Maier et al. 2018). Topic modelling has become a popular tool in social sciences and has been widely used to understand nationalism and its relationships with other theoretical concepts. For instance, Lieberman and Miller (2021) use a topic model to analyze the prevalence of ethnic-based comments on online news sites in Nigeria and South Africa. Ho (2022) uses Latent Dirichlet Allocation to discover recurring themes about Hong Kong nationalism in a corpus of Facebook posts. Wang and Luo (2023) use topic modelling to identify posts related to Chinese nationalism and idol fandom in a public Weibo dataset consisting of Covid-19-related posts. Luschnat-Ziegler (2022) uses structural topic models (Roberts et al. 2019), a variant of topic modelling that incorporates document metadata, to investigate changing trends in the narratives of the Ukrainian Institute of National Memory. Kelling and Monroe (2023) train a topic model to identify words associated with nationalism in community reactions to refugees on Facebook. Corsi (2021) analyze climate change discussions on 4chan, a platform known for widespread hate speech, finding that discussions on race and nationalism are gradually replacing scientific discussions.

Word Embeddings

Word embedding models represent words as numerical vectors in a shared vector space, where the distances between vectors convey the semantic similarity of the words they represent. Models based on word embeddings have achieved great success in a wide range of natural language processing tasks, including part-of-speech tagging, named-entity recognition, and sentiment analysis (Rodriguez & Spirling 2022). Due to their superior performance and versatility, these models have been applied in various social science studies, including nationalism research. Timmermans et al. (2022) use word embeddings to explore the semantic shifts of “nation” (“natie”), “people” (“volk”), and “fatherland” (“vaderland”) in Dutch from 1700 to 1880. Sorato et al. (2024) use word embedding-based methods to quantify social stereotyping of immigrants and refugees in Denmark, the Netherlands, Spain, and the United Kingdom between 1997 and 2018. Marjanen et al. (2019) explore the emergence of ideologies, including nationalism, in Swedish and Finnish newspapers in the 19th century.

Machine Learning and Artificial Intelligence

Recent advancements in machine learning and artificial intelligence have enabled the automatic identification of pre-defined content categories. These approaches often require annotated data as exemplars in model training. Annotations are typically done by researchers coding a subset of data, though in some cases, researchers can leverage annotated text compiled by international data curation teams (Ho & Chan 2023). For example, Chen et al. (2019) use supervised machine learning algorithms to identify issue categories related to Chinese nationalism on Weibo. Bonikowski et al. (2022) employ a novel approach combining RoBERTa, a large language model, with hybrid active learning to iteratively refine model training by annotating challenging cases. This process successfully identified six frames related to nationalism, populism, and authoritarianism with over 95% accuracy. Similarly, Bastos (2024) trained a machine learning classifier to predict nationalist sentiment in tweets during the post-Brexit referendum period in the UK.

Social Network Analysis

Social network analysis is a widely used method in computational social science, focusing on relationships between entities and the structures of these relationships within a system (Scott 2012). Entities in a network can be defined in various ways. For instance, Kashpur et al. (2020) use Russian nationalists’ personal profiles on VKontakte as nodes and friendships as links to study the network structure of Russian nationalism. They identify structural changes in response to the ban of an extreme-right nationalist online group and track the diffusion of extreme-right nationalist ideology into other areas of public discourse. Phua et al. (2020) construct identity networks of Singaporean nationalism, using identity markers as nodes. They demonstrate differences in network coherence between individuals born before and after Singapores independence and estimate the effects of changes in influential markers on overall network structure.

Semantic network analysis, a strand of text-based network analysis, uses concepts as nodes and paired associations as edges. For example, Yang and Veil (2017) use keyword co-occurrences to construct a semantic network and study how corporations employ nationalism to divert media attention during crises. Similarly, Chen et al. (2019) construct issue networks from co-occurrences of issues identified through supervised machine learning. Adopting a related approach, Ho (2022) use topic correlations from a topic model to construct a topic network. Centrality measures from social network analysis was used to identify core topics in Hong Kong nationalism discourse, guiding the selection of materials for subsequent qualitative analysis.

Discussion

In the final section of this paper, I will discuss the future of computational methods and their uses in nationalism studies, followed by a list of risks associated with computational methods that I believe researchers should bear in mind. Lastly, I will end the paper with a proposed framework for computational nationalism studies.

The Outlook

Wimmer is certainly correct in pointing out that naturally occurring data offer unique opportunities for researchers to gain insight into nations and nationalism (Bochsler et al. 2021). It is not a coincidence that most of the work reviewed above relies on data from the internet, especially social networking sites; the rich, fine-grained, massive volume of data generated continuously by these services has transformed not only the methodology but also the research agenda across the fields of social sciences, as they vastly broaden the scope of questions we can address and the behaviors we can observe.

It is worth noting that computer science is an extremely quickly developing field (by the time I received the reviewers’ feedback, several paragraphs of this article already needed updating). It is only a matter of time before social scientists adopt and adapt new technologies developed by computer scientists into their toolkit. Within the past decade, we have already witnessed the introduction of techniques that revolutionized the field (multiple times). For a long period (in computer science’s sense), text analysis techniques relied on the bag-of-words assumption, which represents documents as unordered collections of words. Bag-of-words approaches were quickly replaced by techniques that rely on word embeddings, a family of methods to represent the meaning of words as numerical vectors in a high-dimensional space, with the notable example of word2vec introduced in 2013. The field was then once again transformed by the introduction of the transformer in 2017, a model architecture that relies on deep neural networks. Transformer-based text models (also referred to as large language models, as these models are trained using massive datasets), such as Bidirectional Encoder Representations from Transformers (BERT) and Generative Pre-trained Transformer (GPT), have revolutionized natural language processing by demonstrating unprecedented performance in understanding and generating text in a coherent way.

While pioneering work has already started to incorporate large language models into their methods (eg. Bonikowski et al. 2022; Stier et al. 2023; Terechshenko et al. 2020), researchers have yet (at the time of writing) to fully exploit the full potential of what these new technologies can offer. One prime example is generative artificial intelligence. Generative artificial intelligence refers to technologies capable of producing realistic text, images, and content in other forms. An exemplar is ChatGPT, a chatbot developed by OpenAI that can generate realistic texts and complete certain tasks without any additional training.

There are several attempts to evaluate the ability of ChatGPT in performing research-related tasks, such as identifying hate speech (Huang et al. 2023) and detecting stance in social media content (Zhang et al. 2023). Gilardi et al. (2023) tested ChatGPT’s performance in annotation tasks and claimed that it outperforms crowd workers recruited through MTurk. Törnberg (2024) found that GPT achieved higher accuracy than expert coders and supervised machine learning classifiers in identifying the author of social media posts. Bail (2023) discusses the potential of generative artificial intelligence in improving online experiments, agent-based models, and automated content analyses. While generative artificial intelligence models still suffers from concerns over reliability and reproducibility, their use in nationalism research remains an interesting prospect.

Another area that has undergone rapid development is visual analysis. While we have witnessed rapid development in the computational analysis of texts, the computational analysis of visual content had seemed impossible until recently (Kroon et al. 2023). The analysis of visual data is difficult due to the challenges of extracting information encoded in an image. The fundamental unit of visual data, a pixel, contains much less meaningful information than a word (Joo & Steinert-Threlkeld 2022). Although recent developments in computer vision show promising results, the performance of computer vision systems is highly dependent on the training data, as most models are trained to predict a fixed set of predetermined object categories using crowd-annotated datasets, most notably ImageNet. These models often show poor performance on unseen objects and often fail to detect abstract concepts, which significantly limits their usability for social research. Therefore, even though the internet, and especially social media, is saturated with images, relatively few studies have successfully used visual computational techniques to address theoretical questions.

However, recent advancements in multi-modal embeddings demonstrate great potential. CLIP (Contrastive LanguageImage Pre-training), for example, is trained on image-text pairs collected from the internet, which allows the model to go beyond the predetermined object categories and learn image representation from texts. GPT-4, the most recent large language model by OpenAI, also extended support for using images as input. Another potential prospect for computational nationalism studies is to leverage these tools and move beyond the current heavy reliance on textual data.

The Risks

Artificial intelligence and other computational methods are often criticized for their various biases. The development of computational tools relies heavily on data generated by human beings (Ntoutsi et al. 2020). For instance, BERT is trained with books and Wikipedia while GPT is trained with texts from the internet. As a result, existing biases in society will likely be “learned”, reproduced, and, in some cases, amplified by machine learning models. For example, Bolukbasi et al. (2016) show that word embeddings trained on Google News articles exhibit a disturbing degree of gender stereotypes. Leidinger and Rogers (2024) demonstrate that large language models fail to mitigate stereotyping prompts, especially in terms of ethnicity and sexual orientation. When put into use, bias embedded within these tools can have significant implications for society.

In other instances, bias can arise when data for certain aspects of human behavior are easily obtainable while data for others are scarce. Ho and Chan (2023), for example, show that multilingual large language models exhibit far superior performance in English texts than in languages with lower resources. Bias can also arise among the coding categories, as Rogers and Zhang (2024) show that large language models tend to produce more generic and apolitical labels when compared to qualitative analysis. This can also have significant implications for computational research, especially when the outputs of one task (for example, topic classification) are used in conjunction with a downstream task (for example, a regression model), as the bias of these models can influence each other.

While it is beyond the scope of this piece to offer a comprehensive review of bias in artificial intelligence (see Ntoutsi et al. (2020) for an extensive survey), it is vital to point out the potential dangers of the blind use of new technology. When applying computational methods, it is important to follow established guidelines on how to validate the results and detect bias. To cite a few examples, Grimmer and Stewart (2013) propose four principles for conducting robust computational text analysis. Ho and Chan (2023) introduce an approach to evaluate the performance of multilingual models using visualization techniques and translated data-pairs. Maier et al. (2018) detail the best practices when applying topic modeling. Sen et al. (2021) propose a systematic way to understand potential errors when dealing with digital trace data.

A Framework for Computational Nationalism Studies

I will end this article with a discussion of the best practices for researching nationalism using computational methods. This section presents three principles of computational nationalism studies. They are not exclusive to the study of nationalism but are inspired by the commonly employed guidelines in computational social science, and they offer a useful guide for selecting the best methods and conducting computational analysis.

Theory, Theory, Theory. Theory is needed for almost all social science research.1A notable exception is the quantitative descriptive social science pioneered by, among others, the Journal of Quantitative Description. As Bonikowski and Nelson (2022) correctly points out, theory provides guiding principles for data collection and also the interpretation of findings. This is especially important for computational social science research since computational social scientists typically use data that are “found” (cf. McFarland & McFarland 2015). Researchers are obscured about the data generation process, and the data often suffer from limitations in representativeness and completeness. Computational analysis also necessitates a wide range of analytical decisions, from data preparation (for example, what kind of data, how much data, search criteria, and from where?) to data analysis (for example, how to operationalize certain concepts, which techniques to use?). Theory is therefore needed to guide decisions in the research process (Bonikowski & Nelson 2022). To effectively adapt computational methods, many of which are borrowed from computer science, to social science research, the logic of the chosen methods also needs to be theorized. To cite an example illustrated by Bonikowski and Nelson (2022), it is widely known that word embeddings can capture the relationships of meaning between words, which forms the foundation of a wide range of technological innovations that revolutionized computer science. However, it is the sociological theory that allows Kozlowski et al. (2019) to leverage this characteristic to measure cultural associations between texts and, therefore, public culture in general.

Methodological Pluralism. Computational social science is a rapidly developing field. Grimmer and Stewart (2013), a key text that pioneered the adoption of text analysis in social sciences, listed a wide range of methods in 2013, but many of them have already been replaced by recent inventions that are claimed to be better, faster, and (in many cases) bigger. The ability to assess the strengths and weaknesses of each method and choose the most appropriate one has become the most important skill for computational social scientists.

Bonikowski and Nelson (2022) offer some guiding principles for choosing between unsupervised and supervised methods, as they argue that the two method families align reasonably well with the logic of inductive and deductive research. In general, unsupervised methods, such as topic modeling, are most useful in inductive exploration of the patterns within the dataset. Supervised methods, like supervised machine learning, excel when researchers have predefined categories in mind. However, it is also important to note that the line between supervised and unsupervised methods has become increasingly blurred by the invention of methods like semi-supervised learning and few-shot learning. On another level, choosing the right method(s) for a particular research question is not limited to the choice between two or more computational techniques. Rather, in many cases, it is also beneficial to consider methods from other paradigms (for example, qualitative methods) and data from other sources (for example, interviews).

In short, I recommend following the idea of methodological eclecticism, which involves selecting and mixing the most appropriate methods from a reservoir of qualitative, quantitative, and computational techniques for a given research task (Teddlie & Tashakkori 2010). An exemplar is Ho (2022), where the author employed a mixed-method strategy, which began with a quantitative phase where LDA topic modeling was used to reveal the major topics within the dataset. Afterwards, social network analysis was employed to gain insight into the structure of the discourse network and identify the core topics. Lastly, a qualitative analysis was conducted to theorize the core topics as predominant frames.

Validation and Evaluation. Running a computational model is relatively easy, as the model will always produce some outputs (for example, a topic model will always produce some topics), but it does not mean that the results are always useful (van Atteveldt et al. 2022). Therefore, for any computational method, it is vital to conduct validation. While there are different ways to conduct validation for different computational methods, in its essence, validation requires the researcher to demonstrate that the output of the model reliably replicates human coding (Grimmer & Stewart 2013). For supervised methods, this can be done by comparing machine predictions with a human-coded “gold standard”. For unsupervised methods, we can employ established methods such as word intrusion tests and topic intrusion tests. There are also various open-source tools researchers can use. For example, the R package “oolong” offers a convenient interface to conduct validation for various topic models (Chan & Sältzer 2020).In addition, since nationalism is a highly context-sensitive subject, and we know that the performance of computational models could be affected by various biases in language and societal context, I also recommend, if possible, using techniques of explainable artificial intelligence to increase model interpretability. For example, Ho and Chan (2023) demonstrate how to use Local Interpretable Model-Agnostic Explanations (LIME) to conduct error analysis and gain insight into the characteristics of model error.


  • 1
    A notable exception is the quantitative descriptive social science pioneered by, among others, the Journal of Quantitative Description.