DiSSCo Prepare 3.2.2 Best practice standardised Extract, Transform and Load (ETL) procedures
Version 0.0.1 / 2022-02-04
PLACEHOLDER: Authors
PLACEHOLDER: Author affiliation
This document is a working draft and is not going to be published in its current format. Final format will be a report and a living best practices documentation on an open platform that allows people to contribute to the document.
Abstract
DiSSCo (Distributed System of Scientific Collections) is a pan-European Research Infrastructure (RI) that among other things aims to create a digitisation infrastructure for natural science collections.
Table of Contents
3.4 Combined Workflow Procedures. 8
Table 2: Template of Best Practices. 9
Table 3: Explanations of the Template of Best Practices. 9
4.1 Infrastructure recommendations. 10
4.2 Organisational recommendations. 13
4.3 Identifier recommendations. 14
4.4 Image transformation recommendations. 15
4.5 Specimen data recommendations. 15
4.6 Quality control recommendations. 16
4.7 Media metadata recommendations. 16
4.9 Crowdsourcing recommendations. 16
4.10 CT scans / 3D model recommendations. 17
4.11 Analytical/chemical/molecular data recommendations. 17
APPENDIX: REVISION HISTORY.. 18
APPENDIX: IMPLEMENTATION DEMONSTRATIONS. 18
APPENDIX: Workflows / documentation provided for this WP. 19
APPENDIX: Literatures with digitisation workflows. 21
APPENDIX: Before ETL workflows. 23
APPENDIX: After ETL workflows. 39
1. Introduction
DiSSCo (Distributed System of Scientific Collections, https://www.dissco.eu/) is a pan-European Research Infrastructure (RI) that among other things aims to create a digitisation infrastructure for natural science collections. Overview of the infrastructure is described in its conceptual design blueprint (Har2020). DiSSCo is a new world-class research infrastructure for natural science collections. It is assumed that up to 40 million specimens may need to be digitised each year so that the digitisation of a significant part of important public natural history collections can be achieved in foreseeable time. It is up to hundreds of megabytes data per each digitised specimen, which will be generated at different distributed digitisation facilities across Europe. The large amount of data generated at the digitisation stations will go through various Extract, Transform, Load (ETL) procedures before it is shown in Collection Management System (CMS) and/or data sharing/publication portals. ETL procedures is quite critical in the digitisation process. Therefore, it is necessary to provide best practice on standardised ETL procedures to facilitate and optimize the digitisation process at DiSSCo institutions.
This project report was written as a formal Milestone (M3.6) of the DiSSCo Prepare Project (https://www.dissco.eu/dissco-prepare/). The following text is the formal description (Subtask 3.2.2) from the DiSSCo Prepare project’s Description of the Action (workplan):
Subtask 3.2.2 Standardised Extract Transform and Load (ETL) procedures. Handling metadata and images during digitisation involves many transformations, as information is modified and held in various temporary (staging) environments, before reaching the institutional collection management Systems (CMS) and being made accessible through public portals.
This subtask will document best practices for these processes, where necessary including the computational workflows required to support data transformations.
This Best Practice Document (BPD) will help enhance natural science collection specimen digitisation capacity across DiSSCo partners and the DiSSCo national nodes. Together with other deliverables of WP3.2, this document will form a Community Digitisation Manual. This work is done between the following task partners.
- Natural History Museum, London (NHM)
- Finnish Museum of Natural History (Luomus)
- Meise Botanic Garden (MeiseBG)
- Museum für Naturkunde, Berlin (MfN)
- Royal Botanic Garden, Edinburgh (RGBE)
- Universidade de Lisboa (ULISBOA)
In this work, we firstly made an overview of the goal and the scope of this Best Practice Document (BPD). Secondly, digitisation workflows from partner institutions and other related work were reviewed to find the potential ETL procedures in each part of the workflow. Thirdly, we made a list of the best practice recommendations. This work is ended with the discussion chapter.
2. Overview of the Work
2.1 Scope
Extract, Transform, Load (ETL) is an upper concept that can mean moving any data from place A to B. The term ETL is most commonly used in context of moving data from multiple databases into a single data warehouse for analytics. In context of natural history collection digitisation workflow, ETL processes can be considered to start from getting data forward from digitation stations up to the point that the data is in Collection Management System (CMS) and/or data sharing/publication portals. After the ETL process, there may be many further steps (handled for example by the CMS) concerning data validation, cleaning, annotation, crowd sourcing, using AI-based methods on enriching the data and finally sharing it to 3rd party platforms such as Global Biodiversity Information Facility (GBIF) - to name only a few. Those before mentioned steps are not covered by this Best Practice Document (BPD), except for AI-based methods which may be applied during data transformation during the ETL procedures. Long-term archiving may happen as a part of the initial ETL process or after: we have included it to this BPD as it is an important data transformation. Figure 1 shows the generalized data flow and the scope of the BPD on the ETL procedures in the digitisation process.
Figure 1: Generalized view on how information from physical specimen can reach data users as a result of digitation. (OCR=Optical character recognition; AI=Artificial Intelligence; ETL=Extract-Transform-Load procedures; CMS=Collection Management System; API=Application Programming Interface; GBIF=Global Biodiversity Information Facility; IIIF=International Image Interoperability Framework; Long-term archiving=medium where data is stored "forever")
There are variants of the level of specimen digitisation at different institutions. The variants can be from
- Collection types (insect, herbarium sheets, mosses, microscope slides, fossils, rocks, ...)
- Collection sizes (from few specimen to millions of individual specimen)
- Digitation media (textual data, images, CT scans/3d models, DNA barcodes, ...)
- Organisational maturity (from "disorganised" to higher levels of institutional organisation)
- Organisation size (many teams vs individual people)
- Technical advancement (from manual to semi-automated to almost fully automated and from human made labour to AI and robotics based methods)
Different digitation levels demand different approaches. For example, for massive insect collections, a high level of automation is needed but setting up such an infrastructure for a small rock collection would not be ideal. Thus, it is not possible to propose a single standardised procedure. Instead, this BPD lists a number of recommendations. Each party can use the recommendations to evaluate if they apply to their particular digitation projects.
Furthermore, since institutions operate inside very different infrastructures, this BPD does not recommend any particular software, service providers or other concrete methods on how to implement the recommendations. The recommendations should be considered as goals that institution staring to set up/improve their digitation infrastructure should try to meet.
2.2 Audience
The main target audience of the Community Digitisation Manual has been agreed to be institutions that are at the beginning of building up their digitation process. The recommendations in this BPD are categorized according to their level of advancement from very basic/must have recommendations to more advanced recommendations. To be able to implement the requirements, the organisation must have a certain level of technological capacity and resources (servers, infrastructure): for example a single individual can not successfully implement recommendations of this BPD. More details of the agreed target audience and ALA Digitisation maturity model is in DiSSCo Prepare Milestone 3.5 TODO LINK.
The initial target audience for the community digitisation manual was agreed to be organisations at the ALA Digitisation Maturity Level 1 and 2. At this stage, Maturity Level 0 was considered out of scope, because organisations at this level would require detailed guidance and individualised support.
2.3 BPD Template
To assess the best practice is this work, we use a template defined by Alwazae et al. (Alw2015) to formalise the goals and maturity of this BPD as in Table 1.
Table 1. Overview of this BPD based on the template defined by by Alwazae et al. (Alw2015).
Summary | This BP helps implement Extract-Transform-Load (ETL) procedures for data from digitation stations to their final publishing platforms and how the ETL procedures should fit in the overall DiSSCo infrastructure. |
Goal | Applying the BP ensures data is not lost, improves efficiency and makes the data more interoperable. |
Means | Successfully applying this BP requires IT specialists with skills in (1) programming, (2) server administration, (3) image processing AND availability of cloud based servers and sizeable storage capacity. |
Cost | Technology costs: Institutions may have access to "free" services provided by academic research infrastructures nationally or internationally, or their parent university of other organisation may be able to provide the needed technology. However, someone ultimately has to pay the needed resources. In terms of computation power, a minimal viable digitation ETL process does not incur great costs. Use of advanced methods like OCR or AI techniques require more computation resources. Storing and long-term archiving digital material (images, 3d models) and cost of moving the material in and out of storage can be very costly, up to tens of thousands EUR/year. TODO: Ask for partners about their figures. Personnel costs: TODO |
Barriers | TODO: Obstacles or problems that may occur before, during, and after applying the BP. |
Barrier Management | TODO: Procedures to follow if certain obstacles or problems are encountered. |
Acceptability | This document is not yet reviewed by domain experts and it is not yet assessed if it helps to resolve the problem addressed by the BP. / 2022-03 |
Usability | TODO: Ask for feedback to which degree this BP is easy to use. |
Comprehensiveness | This document does not describe a comprehensive BP. Instead, various individual recommendations are listed. We have selected the most important recommendations for institutions that are at the beginning of building up their digitation process. |
Justification | This BP has not yet been used by any institution. / 2022-03 |
Prescriptiveness: | This BP offers concrete proposals for solving the problems; however the underlying infrastructures vary so much that only examples if actual implementations can be provided. |
Coherence | This BP does not form a coherent unit: certain parts only apply to certain types of digitation efforts. |
Consistency | This BP is (TODO WILL BE) consistent with existing knowledge and vocabulary used in digitation of natural science collection sector and knowledge domain as leading experts on the field have participated and reviewed the living document / Not yet done at the moment 2022-02 |
Granularity | TODO: The degree to which the BPD is appropriately detailed |
Adaptability | TODO: The degree to which the BP can be easily modified and adapted to other situations |
Integration | TODO: The degree to which the BP is integrated with other BPs and KM components |
Demonstration of Success | See end of this document / None so far / 2022-02 |
3. Review of Digitisation Workflows
To identify the ETL procedures in the digitisation processes, we did a review of digitisation workflows from available publications and reports, and project partners’ documents. The extensive list of the related literatures is in Appendix. We extracted steps/procedures from those workflows and tried to list them into three catalogues in regarding with ETL procedures (before, within, and after it). Those lists can be seen from the Appendix.
The outcome lists of digitisation steps are not recommended workflows, or even functional workflows, rather it is union of all potentials workflows in digitisation of natural history collections. It is a tool used in mapping the landscape so that BP recommendation on ETL can be created from the different steps/processes/procedures.
3.1 Infrastructure
Infrastructure is the one of the most fundamental core parts in the digitisation processes. This is the first step to building process and requires careful plan. It affects the following steps of the data flows. Therefore, good practice should be followed on setting up the digitalisation infrastructure. Within the scope of the ETL processes, the infrastructure involve the local digitisation station, remote servers, CMS system, data backup system, etc. Generally, there are following components
- Local storageat the digitation site to which digitation line/other hardware connects to - typically a local machine (not a server)
- Staging area to which raw digitalized material is transferred from the local machine for processing - NOT meant to store data for longer periods of time - server based
- Image archiveto which large original TIFF (etc) files are stored - cloud based
- Publishing platform file storage(image server) to witch ready material is transferred so that it is accessible from the web - cloud based
- CMS/data repository(relational or other database) to which specimen data, image metadata etc is stored - cloud based
- Backup storageto which resources from image archive, publishing platform are periodically backed up)
3.2 Organisational Models
Different institutions have different organisation structure on the digitisation. Some may have own in-house IT development staffs. Some may have outsourced digitisation contractors. For example, at The Finnish Museum of Natural History (Luomus), the digitisation workflow involves digitisation team and three IT staff, one digitisation software developer/manager, and one server administrator, and a data manager (Luo1). There are digitisers and one IT manager at Herbarium LISU (LIS1). And for The Natural History Museum London, digitisers and data management team are involved in the digitisation workflows. Different models have its advantages and disadvantages, which should be well considered before the starting of digitisation actives based on with institutions’ own situations. We made on BP of recommendations in the next chapter.
3.3 Data Models
There are variants in the data models in different workflows.
Luo1: Specimen data does not have information about what media there are about them (in CMS - this information is available in search engine (elastic search) and FinBIF national warehouse); media are queried real time from Image-API using the specimen id
LIS1: Links to media are stored to specimen data
Todo: Include NHM London work (Sco2019) and expand the texts
3.4 Identified Workflow Procedures
3.3.1 Before ETL Workflows
The before ETL workflows are mostly on the digitisation stations, as shown in the Appendix Before ETL workflows. Most of those workflows are related to the barcoding of the specimens, imaging, quality control, image processing, metadata generation and upload. It also involves the data transmission from the digitisation station to the staging area, CMS, image publishing platform, and data backup. The objects are identifiers, image, image metadata, and specimen data. There are different actions, such as manual, semi-automated, automated on those workflows. Some of the actions require the manual work.
3.3.2 ETL Workflows
The Appendix ETL workflows lists the examples of ETL workflows. ETL workflows are mostly on the staging area. They are doing image pooling, data quality control, file renaming, data export and publishing, image conversion, and data backup. The data will go to CMS, Image publishing platform, image archive, based on the workflow. Most of the workflows here are done semi-automated or fully automated. To automate those workflows as possible will increase the efficiency of the digitisation process and decrease the potential risks of human mistakes.
3.3.3 After ETL Workflows
In the Appendix After ETL workflows, it lists all those After ETL workflows we found. They are mostly on CMS, Image archive, and data backup storage. They are related to the data backup, data linkage, and data enrichment. There are different actions, such as manual, semi-automated, automated on those workflows. Some of the actions here still require the manual work.
4. BEST PRACTICES
This section lists the Best Practices (BP) recommendations we have been able to determine in the reviews of digitisation workflows done in chapter 3. The template of recommendations is given in Table 2.
Table 2: Template of Best Practices
Id | EXAMPLE1 |
Level | BASIC | ADVANCED | STATE-OF-ART |
Use case | As xxx I want to xxx so that I can xxx |
Best practice recommendation | Procedure to follow/task to accomplish that fulfils the use case |
Discussion | Rationale behind the recommendation |
Implementation example | One or few references/examples on how the recommendation has been implemented in practice if applicable |
References | Link, Ref |
The explanations of the items in the BP template are in the Table 3.
Table 3: Explanations of the Template of Best Practices
Id | To make it easier to communicate about an individual recommendation |
Level | Level: how demanding the recommendation is BASIC: A fundamental goal that everyone doing digitation should try to fulfil ADVANCED: Next steps in automating and improving performance STATE-OF-ART: New upcoming techniques that should perhaps not be attempted to take into account at first |
Use case | An use case which acts as a motive for the recommendation |
Best practice recommendation | Procedure to follow/task to accomplish that fulfils the use case |
Discussion | Discussion about the rationale of the recommendation Implementation example |
Implementation example | One or few references/examples on how the recommendation has been implemented in practice if applicable |
References | Links to external documentation or publication which is the source of the use case or recommendation and/or to implementation example |
4.1 Infrastructure recommendations
Id | INFRA1 |
Level | BASIC (+STATE-OF-ART) |
Use case | As digitation manager I want no significant data loss to occur and have a reliable system so that digitation process is not delayed |
Best practice recommendation | Your digitation/ETL/publishing/CMS infrastructure should generally have the following components · Local storage at the digitation site to which digitation line/other hardware connects to - typically a local machine (not a server) · Staging area to which raw digitalized material is transferred from the local machine for processing - NOT meant to store data for longer periods of time - server based · Image archive to which large original TIFF (etc) files are stored - cloud based · Publishing platform file storage (image server) to witch ready material is transferred so that it is accessible from the web - cloud based · CMS/data repository (relational or other database) to which specimen data, image metadata etc is stored - cloud based · Backup storage to which resources from image archive, publishing platform are periodically backed up (see recommendation INFRA2) · STATE-OF-ART: Long-term archive to which all data is eventually replicated to be stored "forever" (see recommendation INFRA3) · |
Discussion | · Local storage: Data is not mean to stay for long time at the local digitation station. It should be moved daily or at least weekly forward. Loss of these stations does not incur significant data loss. Setting up the environment again may take a long time and mean delays in digitation process. Docker(etc) image based environment is recommended so that it is quickly set up on any new local computer (may not always be possible because of software licenses etc) · Staging area: ETL procedures may require computing power which is best done on a server / computing clusters rather than on the local machine; procedures are automated and software driven so ease of deploying new versions is a benefit. State-of-art environment would for example be a Kubernetes container cluster to which different ETL process steps are deployed as individual services/pods and co-operate to provide the ETL procedure. A test environment exist where software is tested before putting to production. · Image archive should be cloud based to prevent data loss. Hard disk failures are common, which can be negated running a RAID disk server. However, we do not recommend institutions to run their own disk servers or any other servers, as cloud based services are more cost efficient, professionally managed and data loss is almost impossible (except as a result of human error - so backups are still needed). It is a good idea to separate the live-publishing server data storage (containing smaller JPGs etc) and the original raw data (TIFF etc). This allows for example to use faster disk for publishing. Furthermore, as data in image archive is not needed often, it does not need to accessible from the internet. It can be for example an object storage database instead of a conventional file system. · Publishing platform file storage: Here uptime and performance are important as is prevention of data loss (which causes downtime). We recommend a cloud based service for these mentioned reasons. · CMS/data repository: Data loss in your CMS database would be catastrophic. It needs to be professionally administrated, backed up. Cloud based solution is a must. Databases contain text and do not typically take much space. Regular backups should be done in professional manner. · Backup storage: Even if original data is located on cloud based servers, data loss can occur as a result of human error. It is problematic to find another large enough place to put your biggest data: finding a suitable place for the image archive can be difficult and for backup there would be a second location, as having data twice on the same service doesn't quite fit the need. If no other solution can be found, image archive and backup storage can reside in the same service, which at least helps in case of human made accidental deletion. · Long-term archive: LTA would be a third place your data resides. It doesn't always fulfil the function of backup storage, as data is stored to LTA in formats that are designed to be ever-lasting and may be somehow modified as a result. It might not be easy to recover data from LTA as getting lots of data out from LTA is not typically what they are designed for. LTA is almost impossible to implement by your own institution, so you should seek for research infrastructures that can provide the service for you. We have marked LTA to be "STATE-OF-ART" (very demanding) using this BPD's three level scale. It is not something you should try to set up first. |
Implementation example | Luomus: · Local storage: Helsinki University IT centre provides local work stations, administrates security, network, user accounts etc · Staging area: Finnish IT Centre for Science (CSC) provides virtual servers (cPouta; OpenStack based) · Image archive: CSC research data storage service (IDA) - for even larger 3d scans in the future CSC object storage (Allas); providing space in petabytes; not based on conventional file system · Publishing platform file storage: CSC virtual server mounted disk (cPouta; OpenStack based) · CMS/data repository: Helsinki University IT centre provided Oracle database (running on their OpenStack based virtual server environment) · Backup storage: For publishing platform images: Helsinki University provided disk; for Image archive: none so far · Long-term archive: Not yet implemented; will be at CSC provided national service (Digital Preservation Service (DPS)) |
References | TODO |
Id | INFRA2 |
Level | BASIC |
Use case | TODO |
Best practice recommendation | TODO: Backups |
Discussion | TODO |
Implementation example | TODO |
References | TODO |
Id | INFRA3 |
Level | BASIC |
Use case | TODO: Long-term archiving (LTA) |
Best practice recommendation | TODO |
Discussion | TODO |
Implementation example | TODO |
References | TODO |
Id | INFRA4 |
Level | BASIC |
Use case | TODO |
Best practice recommendation | Do not user SSD disk on local servers / staging areas |
Discussion | On digitation stations, high volumes of data is always coming in and then deleted. SSD disk have limited number of reads they can do. Use traditional disks. |
Implementation example | TODO |
References | TODO |
TODO: information security recommendations...
TODO: clean up of digitation stations
4.2 Organisational recommendations
Id | O1, O2 |
Level | BASIC |
Use case | As museum director I want to use limited monetary resources efficiently so that I can provide best value to society. |
Best practice recommendation | O1: Automate recurrent routine task as much as possible as part of the ETL process. O2: Employ/acquire one or few software developers instead of adding more digitation staff to speed up digitation. |
Discussion | Software development is expensive, but spending development resources in automating tasks will eventually save money by reducing staff costs (or allowing to use those staff more efficiently). |
Implementation example | Instead of having staff manually create thumbnails with an image editor, develop an image service that does the job; use existing image libraries available (such as ImageMagick). |
References | All2019; TODO |
Id | O3 (TODO - subjective - needs discussion) |
Level | BASIC |
Use case | As digitation manager I want to prioritise digitation efforts based on scientific criteria instead of existing procedures so that I can provide that information which is most needed to research |
Best practice recommendation | O3: Maintain sufficient in-house skills in IT (software development and server administration) |
Discussion | TODO: More subjective than most recommendations. Digitation is a continuously changing field where advances are continuously made. New kind of digitation projects start end as digitation moves to different types of collections. This means changes to existing processes are needed often. Buying services from outsourced party may not be flexible enough. On the other hand, should an excellent partner exist, they could be able to keep up to technological advancement better than small institutions own IT staff. Ordering software and services from outside partner requires IT skills and knowledge so you can order and explain to the tech people what exactly is the service you need. TODO: Weight PROs and cons of the approaches |
Implementation example | TODO |
References | TODO |
4.3 Identifier recommendations
PLACEHOLDER
ID1: OpenDS identifier minting
TODO: multiple specimen NHM8 has workflow
4.4 Image transformation recommendations
PLACEHOLDER
jpeg originals
thumbnails (jpeg or png)
zoomify files
raw data is archived (tiffs etc)
4.5 Specimen data recommendations
Id | DD1, DD2 |
Level | ADVANCED |
Use case | As researcher I want to know if data is reliable/complete so that I can determine if it can be included to my research. |
Best practice recommendation | DD1: When data is extracted from digitalisation platform to CMS, make sure the information about fields marked as empty/missing and field not databased is not lost in transition/mixed with each other. DD2: If OCR is applied during ETL process, the CMS should support marking data field to be "automatically filled" and ETL process should make sure to fill in this information. |
Discussion | Data field value can be one of the following: TODO needs more work 1. absent: information has not been documented at time of collection event and can not be later resolved 2. unknown: information is documented but is not yet databased 3. unknown:missing: the information would have been databased but is absent 4. unknown:indecipherable: the information appears to be present but failed to be captured 5. automatically filled: information has been databased using automated methods (OCR) but not yet cleaned/verified by a human 6. default: information is present and has no known problems 7. erroneous: information is present but contains errors/marked as unreliable by a human 8. unknown:withheld: information is databased but has been withheld by the provider (Note: not a factor for ETL processes; this is data publishing problem) |
Implementation example | Not known to be fully implemented? TODO |
References |
PLACEHOLDER
4.6 Quality control recommendations
PLACEHOLDER
4.7 Media metadata recommendations
PLACEHOLDER
What fields SHOULD be present in the Metadata
What fields SHOULD NOT be present (if any — for occurrence images it is important to remove coordinate information for sensitive species but for digitation stations there is no information that could not be shared?)
4.8 OCR recommendations
PLACEHOLDER
4.9 Crowdsourcing recommendations
PLACEHOLDER
4.10 CT scans / 3D model recommendations
PLACEHOLDER
4.11 Analytical/chemical/molecular data recommendations
PLACEHOLDER
REFERENCES
All2019: Allan L et al. (2019) Digitisation using Automated File Renaming and Processing. Microscopes Slides. (TODO PUBLISHED?)
Alw2015: Alwazae M., Perjons E, & Johannesson P (2015) Applying a Template for Best Practice Documentation. Procedia Computer Science 72 (2015) 252 – 260. https://doi.org/10.1016/j.procs.2015.12.138
Dil2019: Dillen M, Groom Q, & Hardisty A. (2019). Interoperability of Collection Management Systems. Zenodo. https://doi.org/10.5281/zenodo.3361598
Dri2014: Drinkwater R, Cubey R, Haston E (2014) The use of Optical Character Recognition (OCR) in the digitisation of herbarium specimen labels. PhytoKeys 38: 15-30. https://doi.org/10.3897/phytokeys.38.7168
Gro2019: Groom Q et al. (2019) Improved standardization of transcribed digital specimen data. Database, Volume 2019, 2019, baz129. https://doi.org/10.1093/database/baz129
Has2012a: Haston E, Cubey R, Pullan M, Atkins H, Harris D (2012) Developing integrated workflows for the digitisation of herbarium specimens using a modular and scalable approach. ZooKeys 209: 93-102. https://doi.org/10.3897/zookeys.209.3121
Has2012b: Haston E, Cubey R, & Harris D J (2012) Data concepts and their relevance for data capture in large scale digitisation of biological collections. IJHAC, Volume 6, Issue 1-2. https://doi.org/10.3366/ijhac.2012.0042
Har2020: Hardisty A, Saarenmaa H, Casino A, Dillen M, Gödderz K, Groom Q, Hardy H, Koureas D, Nieva de la Hidalga A, Paul DL, Runnel V, Vermeersch X, van Walsum M, Willemse L (2020) Conceptual design blueprint for the DiSSCo digitization infrastructure - DELIVERABLE D8.1. Research Ideas and Outcomes 6: e54280. https://doi.org/10.3897/rio.6.e54280
Hid2020: Nieva de la Hidalga A, Rosin PL, Sun X, Bogaerts A, De Meeter N, De Smedt S, Strack van Schijndel M, Van Wambeke P, Groom Q (2020) Designing an Herbarium Digitisation Workflow with Built-In Image Quality Management. Biodiversity Data Journal 8: e47051. https://doi.org/10.3897/BDJ.8.e47051
Sco2019: Scott B, Baker, E, Woodburn M, Vincent S, Hardy H, Smith V S (2019) The Natural History Museum Data Portal, Database, Volume 2019, 2019, baz038, https://doi.org/10.1093/database/baz038
APPENDIX: REVISION HISTORY
TODO: Flatten initial version history to first published version after release
Version / date | Author(s) | Description |
Version 0.1 / 2022-02-07 | Esko Piirainen | Initial template, compilation of different workflows, skeletal for best practices |
APPENDIX: REVIEW HISTORY
Review date | Reviewed version | Reviewer | Notes |
APPENDIX: IMPLEMENTATION DEMONSTRATIONS
Template for reporting BP was applied in an organization
Implementing organization | Name of org, contact person, contact info |
Implementation time | Start date, end date |
Implementation cost | How many person work months implementation took |
Experiences and feedback | |
Measurements | See appendix: Measurement Report measurable improvements in performance |
The BP has not been currently demonstrated in practice.
APPENDIX: MEASUREMENT
PLACEHOLDER: Indicators for measuring the quality and performance of the BP
APPENDIX: Workflows / documentation provided for this WP
Link | Organisation | Desc | Ref | Done | Notes |
Luomus | Workflow for insect-line mass digitisation process Workflow for non-mass digitisation processes | Luo1 | x | ||
Luomus | Plans on how CT scan/3d model workflow will happen | Luo2 | x | ||
LISI Inst de Agronomia - Univ de Lisboa | LISI Herbarium Digitization Workflow | LIS1 | x | TODO: Contains list of image metadata fields for recommendation | |
RBGE Royal Botanic Garden Edinburg | RBGE Digitisation Workflows | RBGE1 | x | TODO: Contains OCR workflow for OCR best practices | |
RBGE Royal Botanic Garden Edinburg | RBGE ETL Processes | RBGE2 | x | TODO: Contains list of metadata fields for recommendation | |
RBGE Royal Botanic Garden Edinburg | Developing integrated workflows for the digitisation of herbarium specimens using a modular and scalable approach | Has2012a | |||
RBGE Royal Botanic Garden Edinburg | Data concepts and their relevance for data capture in large scale digitisation of biological collections | Has2012b | |||
RBGE Royal Botanic Garden Edinburg | The use of Optical Character Recognition (OCR) in the digitisation of herbarium specimen labels | Dri2014 | |||
NHM, London | Summary of other doc + specimen data to CMS | NHM1 | x | ||
NHM, London | Slide Digitisation - End of day checklist | NHM2 | x | ||
NHM, London | eMesozoic workflow diagram | NHM3 | ( x ) | Needs clarifying, see notes in workflow bellow | |
NHM, London | ALICE Workflow | NHM5 | Understanding this workflows would require some hand holding | ||
NHM, London | Microscope slides digitisation - article | All2019 | ( x ) | Still may have some workflow steps not covered | |
NHM, London | Airless workflow diagram | NHM7 | Needs explanation | ||
NHM, London | Bee types digitisation workflow | NHM8 | ( x ) | TODO: has workflow for multiple specimen for recommendation reference Still may have some more steps that were not understood | |
NHM | The Natural History Museum Data Portal | Sco2019 | |||
Meise Botanic Garden | Designing an Herbarium Digitisation Workflow with Built-In Image Quality Management | Hid2020 | |||
Meise Botanic Garden | Image processing, storage diagram | MEISE2 | (Diagram texts are brief but some info can be extracted) | ||
Meise Botanic Garden | Botanical Collections Data Portal - publishing pipelines | MEISE3 | x ? | If understood correctly, this is out-of-scope for digitation process Deals with data publication to national portal+GBIF | |
Museum für Naturkunde Berlin | Workflow diagrams | MfN1 | (Diagram texts are brief but some info can be extracted) | ||
Museum für Naturkunde Berlin | MfN worklow ETL summary | MfN2 | x | ||
APPENDIX: Literatures with digitisation workflows
Link | Organisation | Name | Ref | Done | Notes |
ICEDIG | Interoperability of Collection Management Systems | Dil2019 | |||
ICEDIG | Quality Management Methodologies for Digitisation Operations | ||||
ICEDIG | Mass-imaging of microscopic and other slides | ||||
ICEDIG | Best practice guidelines for imaging of herbarium specimens | ||||
ICEDIG | State of the art and perspectives on mass imaging of pinned insects | ||||
ICEDIG | State of the art and perspectives on mass imaging of liquid samples | ||||
ICEDIG | State of the art and perspectives on mass imaging of skins and other vertebrate material | ||||
ICEDIG | Methods for Automated Text Digitisation | ||||
ICEDIG | Conceptual design blueprint for the DiSSCo digitization infrastructure | Har2020 | |||
NCSU | Results and insights from the NCSU Insect Museum GigaPan project | ||||
NHM | No specimen left behind: industrial scale digitization of natural history collections | ||||
INHS | InvertNet: a new paradigm for digital access to invertebrate collections | ||||
Swiss Aca of Sci | Handbook on natural history collections management – A collaborative Swiss perspective | ||||
Improved standardization of transcribed digital specimen data | Gro2019 | ||||
Uni Coimbra | A Strategy to digitise natural history collections with limited resources | ||||
Back to the future: A refined single-user photostation for massively scaling herbarium digitization | |||||
NHM | Georeferencing the Natural History Museum's Chinese type collection: of plateaus, pagodas and plants |
APPENDIX: Before ETL workflows
Infra | Step | Action type | Type | Ref | Notes |
Digi station | Fully qualified URI Identifier of the specimen (globally unique persistent identifier) is present as QR-Code on the imaged specimen | Manual (repeated for each specimen) | Identifier | Luo1 | Doc says "barcode" but barcode != qr-code; Best practice is to have the full URI as QR-Code. Luomus has QR-codes. |
Digi station | Barcode is created / scanned | Manual (repeated for each specimen) | Identifier | LIS1 | Most likely internal catalogue number (not fully qualified URI identifier) based on rest of the workflow |
Digi station | Apply barcodes / scan barcodes | Manual (repeated for each specimen) | Identifier | RBGE1 | Unclear if fully qualified URI identifier or internal catalogue number |
Digi station | Before capturing image, specimen data is entered Camera operator enters their details into an online | Manual + Semi-automated (repeated for each specimen) | Specimen data | RBGE1 RBGE2 | |
Digi station | Images are taken in RAW format using CaptureOne software. The barcode on the specimen is scanned by the camera operator and used as the filename for the | Manual (repeated for each specimen) | Image + Identifier | RBGE2 | |
Digi station | The operator selects image(s) and these are processed to TIF format by CaptureOne software. As part of this conversion process the image is cropped to the | Manual (repeated for each specimen) | Image (Transformations) | RBGE2 | |
Digi station | After imaging the dorsal and lateral views, the images have to be rendered (Helicon Focus) and renamed (BardecodeFiler). Then we need to generate a filelist for the dorsal images (command prompt) in order to associate each UID with the correct PTN. After this is done, we remove the PTN from the name of the image (Bulk Rename Utility) and we crop the images to remove the IRN tags and the dead space (Lightroom) 8. Leave to run overnight | Semi-automated (daily / overnight) | Image (Transformations) | NHM8 | |
Digi station | Something called "Syrup" is done after image capture Rename image file with concatenation of scanned specimen barcode and drawer barcode, plus incremental suffix for more than one image of | Automated (presumed) (on-the-fly) | Image + Identifier | NHM3 | Level of automation? |
Digi station | Before capturing image, "System quality control" is done | Automated (on-the-fly?) | Specimen data? (Quality control) | RBGE1 | What is controlled? |
Digi station → CMS | System creates a record for each new barcode an populates record with data | Automated (on-the-fly?) | Specimen data | RBGE1 | |
Digi station → CMS? | The metadata are managed in a MySQL image data management database, and in the image file exif data. The metadata for the original image files are held in one table, and comprise information copied from | Automated? | Image metadata | RBGE2 | Is the mySQL a temporary image metadata repository or also the final one? See doc for exact metadata fields |
Digi station | The camera operator checks to see that there is a pair of images (a RAw & TIF) for each barcode. If either file is missing the images can not be processed, as the image processing service is expecting both. | Manual (repeated for each specimen) | Image (Quality control) | RBGE2 | |
Digi station | Manual checks are done: Look through the list of file names in the final folder - Common errors to check: | Manual (end of day) | Image (Quality control) | NHM2 | |
Digi station | We perform the quality checks; Check that all images look alright | Manual | Image (Quality control) | NHM8 | |
Digi station → Staging area | Image is captured and transferred to dropbox Both the RAW and TIF files are saved onto a network share drive. The folder structure for this includes the camera the operator is using and the operator's username. This is a temporary storage location. | Manual (repeated for each specimen) | Image | RBGE1 RBGE2 | |
Digi station → Staging area | Files manually moved to different folders copy the date folder (with the “images” folder within) in final to: | Manual (end of day) | Image | NHM2 | (Destination seems to be a network drive) |
Digi station → Image archive | Copy the date folder (with the “images” folder within) in final to: Emu-import-dcp_digitisation (This is our back-up area) | Manual (end of day) | Image (Backup) | NHM2 | |
Digi station → ? | Copy the date folder (with the “images” folder within) in final to: DCP-1 - EXTERNAL HARD DRIVE (NOTE: It’s going to be tricky for everyone tosave to the hard drive if you’re all leaving at the same time, so you can do this | Manual (end of day) | Image (Backup) | NHM2 | Reason for the external hard drive? |
Digi station | Metadata such as digitiser name/operator is generated and stored at the digitisation station as text file | Automated (on-the-fly) | Image metadata | Luo1 | |
Digi station | Metadata of all images are generated using XnView and a .ipt-template | Semi automated (once a day) | Image metadata | LIS1 | x1 - difference between this and x2 is not clear |
Digi station | Metadata of all images is generated using Limbs digitization software | Semi automated (once a day) | Image metadata | LIS1 | x2 - difference between this and x1 is not clear See doc for exact metadata fields |
Digi station → Backup storage? | Copy images+metadata in current day folder to external drive | Manual (once a day) | Image, Image metadata | LIS1 | Is the external drive for backup purposes? |
Digi station → Staging area | Copy images+metadata to staging area using FileZilla program | Manual (once a day) | Image, Image metadata | LIS1 | |
Digi station → Staging area | Images loaded onto EMu server | ? | Image | NHM3 | Needs more info |
Digi station | Post-Processing can include color corrections and rendering of scale bars | Manual | Image | MfN2 | |
Digi station → Staging area | Manual workflow for 2D imaging on demand: DNG and PNG files are stored in structured file system Manual upload to DAM system after quality check | Manual | Image (+Quality control) | MfN2 | |
Digi station | In case of multi-focus imaging: image acquisition and rendering of multi-focus images are separated steps. | Manual | Image | MfN2 | |
Digi station | Backups are not done at digitisation station | -- | Infa (Backup) | Luo1 | |
Digi station | Specimen data is entered to Excel spreadsheet | Manual (repeated for each specimen) | Specimen data | Luo1 | |
Digi station | Object related metadata: - Mostly metadata are acquired with Excel spreadsheets, which are designed for enabling bulk uploads into the CMS | Manual (repeated for each specimen) | Specimen data? | MfN2 | "Metadata" == data? Not image metadata? |
Digi station → CMS → Image publishing platform | Images are captured and uploaded straight to CMS using Web UI; thumbnails etc are generated by image API; metadata is created and stored; images are moved to image publishing platform | Semi automated (repeated for each image) | Image, Image metadata | Luo1 | The "ETL" parts are done automated and instantaneously without a specific ETL part in the workflow |
Digi station | Raw scans are done using CT Scanner | Manual (repeated for each specimen) | 3d/CT scan | Luo2 | |
Digi station | 3d model is generated from raw CT scans | Manual (repeated for each specimen) | 3d/CT scan | Luo2 | |
Digi station | A smaller scale 3d model is discretized from the model | Manual (repeated for each scan) | 3d/CT scan | Luo2 | |
Digi station → CMS, publishing patform | Small scale 3d model is uploaded straight to CMS using Web UI; thumbnails etc are generated by image API; metadata is created and stored; images and 3d scans are moved to image publishing platform | Semi automated (repeated for each model) | 3d/CT scan | Luo2 | The "ETL" parts are done automated and instantaneously without a specific ETL part in the workflow |
Digi station | Mostly CT images from scientific projects. Processing is mostly done by requesters and/or student helpers. Raw and processed files are stored in the file system and managed by the lab technicians. Upload routines for long-term-archiving and publication are not established yet | Manual (repeated for each specimen) | 3d/CT scan | MfN2 | |
Digi station | Multiple specimens · Give one barcode per specimen. · Make sure it is clear which barcode corresponds to each specimen (written on the barcode, examples: male/female, a/b/c, type etc.) · Image as many times as the specimens, each time with only one UID visible (the other ones reversed) | Manual | Identifier, Image (multi specimen) | NHM8 | |
Digi station → Staging area | Automated workflows for data acquisition with mobile devices (vertebrate collections and assessments): We use the app ODK Collect. Data are uploaded to a central ODK server. | Automated | MfN2 |
APPENDIX: ETL workflows
Infra | Step | Action type | Type | Ref | Notes |
Staging area | System polls dropboxes; starts to execute if new files found | Automated (running background task) | Image | RBGE1 | |
Staging area | Quality control is done Checks include: · Filename - the file name is checked for format and length. It should be the letter E followed by 8 numbers. Any additional images for a particular barcode should be suffixed using _. If a filename does not pass this is returned to an Errors folder which is manually checked by a Digitisation Officer. · Filesize - the size of the file is checked, if it falls outside of the set parameters the file is returned to an Errors folder which is manually checked by a Digitisation Officer. · Image pair - whilst a manual check has been performed by the camera | Automated + Manual | Image (Quality control) | RBGE1 RBGE2 | |
Digi station → staging area | Images and metadata are fetched in real-time or in batches to staging area | Automated (on-the-fly OR daily) | Image, Image metadata | Luo1 | |
Staging area? | This script takes individual images with metadata encoded in the filename and creates a specimen record with appropriate attachments to the taxonomy and location modules. Metadata encoded in format: “UIDBarcode_LocationIRN_TaxonIRN.jpg” | Semi-automated ? | Identifier, Image, Image metadata | All2019 | |
Staging area | Specimen identifier URI is detected and extracted from specimen image and image is named to match the ID and image metadata is updated to contain the specimen ID | Automated (running background task) | Identifier, Image, Image metadata | Luo1 | |
Staging area? | Systems perform all processing steps and deliver two image files (Tiff/Raw and Png). All technical and administrative Metadata related to the images are delivered with a json sidecar file (XML in METS format for library and archival material) | Automated (when?) | Image (Transf) Image metadata | MfN2 | |
Staging area? → CMS | Object related metadata are acquired in different ways. In one case they are delivered together with the images in the json sidecar file and parsed by the database management team (this process is not yet fully established). In most cases object related metadata are acquired in Excel spreadsheets and imported to the respective CMS | Automated Semi-Automated (when?) | Specimen data or Image metadata? | MfN2 | |
Staging area → CMS | Images are attached to records in Specify (CMS) based on catalog numbers in file names | Semi-automated (how often?) | Image | LIS1 | |
Staging area? → Image publishing platform | Automated import of image files and related metadata to the digital asset management system | Automated (when?) | Image | MfN2 | |
Staging area → Image publishing platform | Specify script creates copies of original large image files and creates thumbnails (PNG) to Specify Attachment Server and renames based on UUID; original filename and location is kept in attachment metadata | Semi-automated (how often?) | Image, Specimen data (Transformations) | LIS1 | Originals are tiff files, about 5574x7370 px,8-bit sRGB |
Image publishing platform | Original TIFF files are converted to JPEGs running a Python script that uses ImageMagick library. TIFF images are kept. | Semi-automated (how often?) | Image (Transformations) | LIS1 | |
CMS | Run SQL UPDATE to modify attached TIF files links to point to generated JPEG links instead | Manual (how often?) | Specimen data | LIS1 | |
Staging area | EMu eMesozoic Batch Operation script | ? | ? | NHM3 | Needs more info |
Staging area | Each TIFF is processed through OCR software; the OCR output is recorded as unstructured text to CMS as separate record (not to primary specimen data) A copy is made of the TIF file which is submitted to an OCR pipeline. | Automated | Specimen data (OCR) | RBGE1, RBGE2 | (to what level automated?) |
Staging area → Digi station? → CMS | Batches of records with shared collectors or geography were then transcribed by digitisation staff, using a record set of the data records in the CMS along with an identical set of images presented in the same order in image-viewing software | Semi-automated? (what intervals?) | Specimen data (OCR) | RBGE1 | |
Staging area → Image publishing platform → CMS | Automated workflows for data acquisition with mobile devices (vertebrate collections and assessments): Data are uploaded to a central ODK server. The process for integration into the media repository and the CMS is also automated | Automated? (what intervals?) | Specimen data Image Image metadata? | MfN2 | |
Staging area | Original sized JPG and smaller thumbnails are generated | Automated (running background task) | Image (Transformations) | Luo1 | |
Staging area | Creation of JPG and zoomify files · A high resolution JPG is produced. This is stored in an online accessible · A tiled image is created. This is stored in an online accessible repository and can be viewed on the online catalogue. | Automated (running background task) | Image (Transformations) | RBGE1 RBGE2 | |
Staging area? | Image rotation and cropping using XnConvert | Semi-automated? Automated? (when?) | Image (Transformations) | All2019 | |
Staging area | The metadata for the transformed files produced by the ETL processes are also managed in the MySQL image data management database. Each will have a record of the original file from which it was derived, along with ... | Automated ? | Image metadata | RBGE2 | See doc for exact metadata fields |
Staging area → Image publishing platform | JPG and zoomify files are moved to Image streaming online service | Automated (running background task) | Image | RBGE1 | |
Staging area → Image publishing platform → CMS | Images loaded into EMu multimedia Load images from source folder into EMu Multimedia. For each unique specimen barcode number in the image file name, spawn a new eMesozoic barcode stub record and attach the image(s) to it. At least: location id attached via location barcode; media id via specimen barcode (???) | Automated (presumed) (on-the-fly?) | Image + Image metadata?? | NHM3 | Needs more info on level of automation and details |
Staging area → CMS | Script takes individual images and attaches them to an existing record by matching the UID (NHMUK barcode) in the filename with an existing record in EMu. Metadata encoded in format: “UIDBarcode_suffix.jpg” | Semi-automated? | Data linking | All2019 | |
Staging area | "Sapphire script" - copy image and location to specimen record Search for the specimen number entered (applying search filters). On | Automated (presumed) (on-the-fly?) | Image + Specimen Data | NHM3 | Needs more info on level of automation |
Staging area → Image archive | Original TIFF images are moved to image archive and deleted from staging area | Semi-automated (couple times a week) | Image | Luo1 | Done using command line tools but could (should!) be automated in the future |
Staging area → Image archive | Archive raw and TIFF files | Automated (what intervals?) | Image | RBGE1 | |
Staging area → Image publishing platform | Generated JPG images including thumbnails are moved to image publishing platform | Semi-automated (couple times a week) | Image | Luo1 | Done using command line tools but could (should!) be automated in the future |
Staging area → Image publishing platform | URL of published images and other image metadata is stored to image metadata database | Semi-automated (couple times a week) | Image metadata | Luo1 | Done running a Python script |
Staging area | Backups are not done at staging area | -- | Infra (Backup) | Luo1 | |
Staging area → Image publishing platform | The automated pipeline moves images for publication and download. | Automated | Image, Image metadata | RBGE1 | |
APPENDIX: After ETL workflows
Infra | Step | Action type | Type | Ref | Notes |
Staging area → Backup storage | "At a later stage" images will be copied to INCD cloud service for backup archiving | ? | Image | LIS1 | Possibly semi-automated? |
Image archive → Long-Term Archive | Images are moved from image archive to long-term archive | TBD | Image | Luo1 | Future feature |
CMS | CMS starts to show specimen images once images are in publishing platform and the URLs of the images are in image metadata service | Automated | Data linking | Luo1 | |
CMS | A SOLR index is used to link the image files to the data records for display on our online catalogue | Automated | Data linking | RBGE2 | |
Staging area | The camera operator’s perform a second check, once all of the images should have been processed to ensure that this has been successful. This is using the same online form as they used prior to processing the images. If any barcodes are showing as unprocessed then the camera operator can resubmit them for processing again, or pass the issue onto a Digitisation Officer to see if they can identify the reason for this failing. | Manual | Image | RBGE2 | |
CMS | Specimen data is upload to CMS using Excel spreadsheet | Manual | Specimen data | Luo1 | |
CMS | Object related metadata: - Mostly metadata are acquired with Excel spreadsheets, which are designed for enabling bulk uploads into the CMS | Manual | Specimen data? | MfN2 | "Metadata" == data? Not image metadata? |
CMS | Georeferencing, validations etc are done by CMS | Automated | Specimen data (Quality control) | Luo1 | |
→ Backup storage | Images, 3d models are automatically backed up to different cloud server environment Databases (specimen, image metadata) are backed up nightly to tape | Automated | Infra (Backup) | Luo1, Luo2 | |
Image archive | The archive folders are included in regular nightly backups. These are written to tape and taken offline, this is a manual process. They are also manually copied onto external hard drives to provide a backup and an easily accessible version of the data once it has been taken offline. | Manual | Infra (Backup) | RBGE2 | |
CMS | OCR raw data was used as an aid to enhancing minimally database records see Dri2014 | TODO | Specimen data (OCR) | RBGE2 TODO: Dri2014 | |
CMS | OCR raw data is picked up as part of the SOLR index and is displayed on our online | Automated | Specimen data (OCR) | RBGE2 | |
Digi station | Manual clean-up: Once all the images have been saved and backed up you can empty the “Processing” folders:Empty the following folders | Manual (end of day) | Infra (Clean up) | NHM2 | |
Digi station | Scripts, developed in-house for the 2015 pilot, are also currently used for a series of processes known as Flows: (1) bulk transfer of image files from the imaging PC to the data managers, and (2) after ingest into EMu the deletion of the original image files on the imaging PC i.e. clear-down process (Flows; Workflow 3) | Semi-automated | Infra (Clean up) | All2019 | |
? | Lu to import “drawer locations” spreadsheet (Locations module) Lu to import “specimen locations” spreadsheet (Catalogue module) Lu to import condition spreadsheet (Condition module) Lu to import treatment/storage spreadsheet (Processes module) Curators to resolve flagged merges | ? (Every three months) | ? | NHM7 | ??? |
-- | At present completely decentralised, following the (niche-)standards of the respective community | -- | Specimen data - Analytical | MfN2 | |
-- | The RBGE uses the CETAF stable identifiers to track material coming from the Herbarium and the Living collection via a molecular collection management system called EDNA. This is an in-house developed system that is under review. | -- | Specimen data - Analytical/chemical/molecular data | RBGE2 |