Data cleaning and transformation

Last modified by Anniina Kuusijärvi on 2024/02/12 12:32

Page in Finnish: Datan siivous ja muokkaus

These instructions describe how old specimen data should be cleaned/edited before saving it to Kotka. These instructions can be applied to other similar datasets, too.

General

The goal is:

  • Transform the data to the format of Kotka
    • split or combine the data to correct columns/fields
    • transform the data to use standardized vocabularies in certain fields
  • Fix errors
  • Supplement the data if necessary
  • Harmonize the data if possible so that contents entered as free text are expressed in a uniform manner

Cleaning the data is each customer's (team, organisation) own responsibility, but it pays off to plan the data cleaning together with FinBIF/Luomus Biodiversity informatics unit. BDI unit helps especially with larger and more complicated data.

Electronic data can be imported to Kotka in two ways:

A) Customer saves the data to an Excel file and imports that to Kotka. This is the recommended way, if there are only a few thousand rows of data. Data has to be transformed to Kotka Excel format for the import to be successful. Excel import sheets and data field descriptions can be found in Kotka.

B) Customer agrees with FinBIF/Luomus BDI that they import the data. This is recommended, if there are thousands or tens of thousands rows of data. Data import is the easier the better the data fits to Kotka import Excel format.

Important principles

  1. There is one named person responsible for the data cleaning. The person decides who edits the data, when and how.
  2. The data that is being cleaned will not be sent to people inside Luomus by email, and is not saved on one's own computer, but is saved on a network drive. (Outside Luomus data of course needs to be sent by email, but keep this to the minimum and keep track of versions.)
  3. Make sure to keep track of the versions of the data and especially of the master data (primary copy of the data). All changes, edits and additions are made only to the master data. This is to avoid the situation where there are several copies of the same data and none of them include all the changes. It is recommended to include the word PRIMARY in the master data file. If copies of this file are taken, the word is left out.
  4. If changes made to the file need to be somehow highlighted, the person cleaning the data creates e.g. a new column Changed to the file, and enters the number 1 on those rows that have been changed. This way these rows are easily found and filtered. It is possible to highlight changes using for example colours, but colour should not be the only way of highlighting, because it is not easy to search by colour and the colours are not kept for example when an Excel file is imported to OpenRefine.

Documentation and archiving data

Throughout the data cleaning process it is worth taking archive copies of the data at regular intervals and write down what changes have been made to each copy. This way it is easier to go a few steps back to a previous version of the data, should for example part of the data get accidentally destroyed.

Also one should write down the steps of the cleaning process, what has been done to the data and how. The aim is to make it possible for future users of the data to track possible inconsistencies or errors in the data and assess its quality. Try to think what you as a data user would like to ask from people that have processed the data before you?

  • Has the data been supplemented with data from other sources, such as field journals or publications?
  • Have original abbreviations been spelled out (person names, variables)?
  • Have locality names been modernised?
  • Were interpretations made for the data?

If the data is being cleaned using OpenRefine, the edit history should be saved using the Extract Operation History -tool.

Tools

Small edits and fixes to the data can be done in for example Excel. For a larger cleaning project e.g. OpenRefine is more suitable (former Google Refine, openrefine.org). It is a handy tool to search for anomalies in the data, harmonise spelling, split and combine columns etc. Changes done in OpenRefine should be saved to a file, that is stored together with the cleaning documentation.

Documentation and notes can be written with a text editor (Notepad or such) or to a Word file.

Help with data cleaning and transformation

  • kotka(at)luomus.fi
  • OpenRefine tutorial and related data file (In Finnish):

OpenRefineTutoriaalinMateriaali.xlsx

Failed to execute the [office] macro. Cause: [The wiki needs to be connected to an office server in order to view office files. Ask your administrator to configure such a server.]. Click on this message for details.