1.1 Descriptive statistics and classic statistical methods for digital data

Last modified by atoikka@helsinki_fi on 2024/01/16 08:08

Descriptive data

Describing data sets is similar to any description of quantitative data - frequencies, means, standard deviations - but not exactly the same. First, what you are often describing is the metadata or data about data. For example, in a text data set, the text is the data of interest, and where it originated (e.g. URL), when it was written, who wrote it and so forth can be considered metadata. Simple summaries of such data should be reported, whether that data is used in the analysis or not. Second, when you are describing the actual data, it is often very complex, like text or images. Simple descriptive statistics are usually not applicable for these.

For metadata, you should aim to make summary information available to interested readers in as much detail as privacy concerns and practical limitations allow. This can be done in appendices or as online supplements. Depending on context, this can mean frequency tables of years in a data set spanning years from multiple years or histograms of counts of messages by author in a discussion forum corpus. The aim should be to give the reader the best tools to evaluate your data.

For complex data, simple summaries are not usually possible. Still, you should strive to give the reader the best possible understanding of your data set. Things like visualizations of document length distributions can be very useful.

If you are using ready-made tools (e.g. software, online analysis platforms), your tool will likely come with some methods of describing your data sets. The various tools listed in this wiki will have their own functions for generating summary statistics of metadata and analysis data. Use those, but if your tool does not come with such tools, use another tool for the description - that could be Excel, R or Python.

If you are using R or Python, here are some examples of good tutorials and tools:

R:
- https://www.tidytextmining.com/tidytext
- https://juliasilge.com/blog/learn-tidytext-learnr/

Regression methods

Some digital data research questions can be answered with classic/standard regression methods, either directly (e.g. do different authors write different amounts of text?) or after enriching the data set with computational (e.g. following a sentiment analysis, predict sentiment based on time of day the message was written) or qualitative methods (e.g. use a classification based on a qualitative deep reading to predict reactions).

1.1 Descriptive statistics and classic statistical methods for digital data

Descriptive data

Regression methods

Clustering methods

Navigation