2.2 Computational text analysis

Last modified by Tomi Toivio on 2024/11/15 11:22

Computational methods for text analysis can be broadly distinguished to supervised and unsupervised methods. Supervised text analysis draws on preclassified training data to infer classes for unseen documents. Unsupervised analysis classifies texts based on their features (e.g., word frequencies) without pregiven information about possible document classes. This distinction does not exhaustively capture all different approaches to text analysis, but it provides a useful way of recognizing differences between methods and their uses.

LLMs

Tools
  • Ollama for local open source LLMs.

Supervised methods

Tools

Unsupervised methods

Tools
  • NLTK
    • Guidebook for processing natural languages with Python
  • TAPoR
    • TAPoR is a gateway to the tools used in sophisticated text analysis and retrieval
    • With TAPoR 3.0 you can

      • Discover text manipulation, analysis and visualization tools
      • Discover historic tools
      • Read tool reviews and recommendations
      • Learn about papers, articles and other sources about specific tools
      • Tag, comment, rate and review collaboratively
      • Browse lists of related tools in order to discover tools more easily
  • Wordfish
    • Wordfish is a computer program written in the R statistical language to extract political positions from text documents.
    • http://www.wordfish.org
  • spaCy
    • Python NLP library with tok2vec, tagger, morphologizer, parser, lemmatizer, senter and ner.

Analysing text data with data science: topic modelling

Tools
  • Gensim
    • Python library for topic modelling
    • Analyse text documents for semantic structure
  • R libraries for LDA topic modelling or its varieties
    • Packages tm, stm
    • LDAviz
  • MALLET
    • Open-source software for the topic modeling of text
  • CorEX
  • topicmodels
  • BertTopic
    • Python topic modeling technique that uses transformers and c-TF-IDF to create dense clusters of texts.
    • Enables easily interpretable topics while keeping important words in each topic description.
Sources

Baumer, Eric P. S.; Mimno, David; Guha, Shion; Quan, Emily & Gay, Geri K. (2017). Comparing grounded theory and topic modeling: Extreme divergence or unlikely convergence? Journal of the Association for Information Science and Technology 68:6, 1397–1410.

Burscher, Bjorn; Vliegenthart, Rens & de Vreese, Claes H. (2016). Frames Beyond Words. Social Science Computer Review 34:5, 530–545.

Denny, Matthew J. & Spirling, Arthur (2018). Text Preprocessing For Unsupervised Learning: Why It Matters, When It Misleads, And What To Do About It. Political Analysis 26:2, 168–189.

https://www.frontiersin.org/articles/10.3389/fpo

Mohr, John W. & Bognadov, Petko (2013). Topic models: What they are and why they matter. Poetics 41:6, 545569.

Nelimarkka, Matti (2019). Aihemallinnus sekä muut ohjaamattomat koneoppimismenetelmät yhteiskuntatieteellisessä tutkimuksessa: kriittisiä havaintoja. Politiikka 61: 1, 633.

Purhonen, Semi & Toikka, Arho (2016) ”Big datan” haaste ja uudet laskennalliset tekstiaineistojen analyysimenetelmät : esimerkkitapauksena aihemallianalyysi tasavallan presidenttien uudenvuodenpuheista 1935-2015. Sosiologia 53 (2016) : 1, s. 6-27.

Pääkkönen, Juho & Ylikoski, Petri (2021) Humanistic interpretation and machine learning. Synthese 199, 1461-1497.

Toivanen, P., Huhtamäki, J., Valaskivi, K., & Tikka, M. (2020). Aihemallinnus hybridin mediatapahtuman ja merkitysten kierron tutkimuksessaMedia & Viestintä, 43(1). https://doi.org/10.23983/mv.91078

Törnberg, Anton & Törnberg, Petter (2016). Combining CDA and topic modeling: Analyzing discursive connections between Islamophobia and anti-feminism on an online forum. Discourse & Society 27:4, 401422.

Ylä-Anttila, Tuukka; Eranti, Veikko; Kukkonen, Anna (2018) Aihemallinnuksesta kehysmallinnukseen. Politiikka 60 (2), 148-156.

Ylä-Anttila, Tuukka; Eranti, Veikko; Kukkonen, Anna (2021) Topic modeling for frame analysis: A study of media debates on climate change in India and USA. Global Media and Communication 18:1, 91-112.

Brier, Alan & Hopp, Bruno (2011). Computer Assisted Text Analysis in the Social Sciences. Quality & Quantity 45, 103128.

DiMaggio, Paul (2015). Adapting Computational Text Analysis To Social Science (and Vice Versa)Big Data & Society 2:2.

Grimmer, Justin & Stewart, Brandon M. (2013). Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts. Political Analysis 21:3, 267–297.

Shah, Dhavan; Culver, Kathleen; Hanna, Alexander; Macafee, Timothy & Yang, JungHwan (2015). Computational approaches to online political expression: Rediscovering a 'science of the social'. In Coleman, Stephen & Freelon Deen (eds.). Handbook of Digital Politics. Cheltenham: Elgar Publishing, p. 281–305. (Helka)

Wilkerson, John & Casas, Andreu (2017). Large-Scale Computerized Text Analysis in Political Science: Opportunities and Challenges. Annual Review of Political Science 20:1, 529–544.

Other approaches

Tools
  1. Digitalresearchtools
    • Digital Research Tools Wiki's listing of text analysis tools
  2. AntConc
    • Toolkit for corpus analysis
  3. Voyant
    • Web-based reading and analysis environment for digital texts
Sources

Yu, Bei; Kaufmann, Stefan & Diermeier, Daniel (2008). Classifying Party Affiliation from Political Speech. Journal of Information Technology & Politics 5:1, 33–48.