DiSSCo Prepare 3.2.2 Best practice standardised Extract, Transform and Load (ETL) procedures

Last modified by Xwiki VePa on 2025/01/08 07:09

Version 0.0.1 / 2022-02-04

PLACEHOLDER: Authors

PLACEHOLDER: Author affiliation

This document is a working draft and is not going to be published in its current format. Final format will be a report and a living best practices documentation on an open platform that allows people to contribute to the document.

Abstract

DiSSCo (Distributed System of Scientific Collections) is a pan-European Research Infrastructure (RI) that among other things aims to create a digitisation infrastructure for natural science collections.

Abstract. 1

Table of Contents. 1

Introduction. 2
Overview of the Work. 4

2.1 Scope. 4

2.2 Audience. 5

2.3 BPD Template. 5

Review of Digitisation Workflows. 7

3.1 Infrastructure. 8

3.2 Organisational Models. 8

3.3 Data Models. 8

3.4 Combined Workflow Procedures. 8

3.3.1 Before ETL Workflows. 8

3.3.2 ETL Workflows. 8

3.3.3 After ETL Workflows. 9

BEST PRACTICES. 9

Table 2: Template of Best Practices. 9

Table 3: Explanations of the Template of Best Practices. 9

4.1 Infrastructure recommendations. 10

4.2 Organisational recommendations. 13

4.3 Identifier recommendations. 14

4.4 Image transformation recommendations. 15

4.5 Specimen data recommendations. 15

4.6 Quality control recommendations. 16

4.7 Media metadata recommendations. 16

4.8 OCR recommendations. 16

4.9 Crowdsourcing recommendations. 16

4.10 CT scans / 3D model recommendations. 17

4.11 Analytical/chemical/molecular data recommendations. 17

REFERENCES. 17

APPENDIX: REVISION HISTORY.. 18

APPENDIX: REVIEW HISTORY.. 18

APPENDIX: IMPLEMENTATION DEMONSTRATIONS. 18

APPENDIX: MEASUREMENT.. 19

APPENDIX: Workflows / documentation provided for this WP. 19

APPENDIX: Literatures with digitisation workflows. 21

APPENDIX: Before ETL workflows. 23

APPENDIX: ETL workflows. 32

APPENDIX: After ETL workflows. 39

1. Introduction

DiSSCo (Distributed System of Scientific Collections, https://www.dissco.eu/) is a pan-European Research Infrastructure (RI) that among other things aims to create a digitisation infrastructure for natural science collections. Overview of the infrastructure is described in its conceptual design blueprint (Har2020). DiSSCo is a new world-class research infrastructure for natural science collections. It is assumed that up to 40 million specimens may need to be digitised each year so that the digitisation of a significant part of important public natural history collections can be achieved in foreseeable time. It is up to hundreds of megabytes data per each digitised specimen, which will be generated at different distributed digitisation facilities across Europe. The large amount of data generated at the digitisation stations will go through various Extract, Transform, Load (ETL) procedures before it is shown in Collection Management System (CMS) and/or data sharing/publication portals. ETL procedures is quite critical in the digitisation process. Therefore, it is necessary to provide best practice on standardised ETL procedures to facilitate and optimize the digitisation process at DiSSCo institutions.

This project report was written as a formal Milestone (M3.6) of the DiSSCo Prepare Project (https://www.dissco.eu/dissco-prepare/). The following text is the formal description (Subtask 3.2.2) from the DiSSCo Prepare project’s Description of the Action (workplan):

Subtask 3.2.2 Standardised Extract Transform and Load (ETL) procedures. Handling metadata and images during digitisation involves many transformations, as information is modified and held in various temporary (staging) environments, before reaching the institutional collection management Systems (CMS) and being made accessible through public portals.

This subtask will document best practices for these processes, where necessary including the computational workflows required to support data transformations.

This Best Practice Document (BPD) will help enhance natural science collection specimen digitisation capacity across DiSSCo partners and the DiSSCo national nodes. Together with other deliverables of WP3.2, this document will form a Community Digitisation Manual. This work is done between the following task partners.

Natural History Museum, London (NHM)
Finnish Museum of Natural History (Luomus)
Meise Botanic Garden (MeiseBG)
Museum für Naturkunde, Berlin (MfN)
Royal Botanic Garden, Edinburgh (RGBE)
Universidade de Lisboa (ULISBOA)

In this work, we firstly made an overview of the goal and the scope of this Best Practice Document (BPD). Secondly, digitisation workflows from partner institutions and other related work were reviewed to find the potential ETL procedures in each part of the workflow. Thirdly, we made a list of the best practice recommendations. This work is ended with the discussion chapter.

2. Overview of the Work

2.1 Scope

Extract, Transform, Load (ETL) is an upper concept that can mean moving any data from place A to B. The term ETL is most commonly used in context of moving data from multiple databases into a single data warehouse for analytics. In context of natural history collection digitisation workflow, ETL processes can be considered to start from getting data forward from digitation stations up to the point that the data is in Collection Management System (CMS) and/or data sharing/publication portals. After the ETL process, there may be many further steps (handled for example by the CMS) concerning data validation, cleaning, annotation, crowd sourcing, using AI-based methods on enriching the data and finally sharing it to 3rd party platforms such as Global Biodiversity Information Facility (GBIF) - to name only a few. Those before mentioned steps are not covered by this Best Practice Document (BPD), except for AI-based methods which may be applied during data transformation during the ETL procedures. Long-term archiving may happen as a part of the initial ETL process or after: we have included it to this BPD as it is an important data transformation. Figure 1 shows the generalized data flow and the scope of the BPD on the ETL procedures in the digitisation process.

Figure 1: Generalized view on how information from physical specimen can reach data users as a result of digitation. (OCR=Optical character recognition; AI=Artificial Intelligence; ETL=Extract-Transform-Load procedures; CMS=Collection Management System; API=Application Programming Interface; GBIF=Global Biodiversity Information Facility; IIIF=International Image Interoperability Framework; Long-term archiving=medium where data is stored "forever")

There are variants of the level of specimen digitisation at different institutions. The variants can be from

Collection types (insect, herbarium sheets, mosses, microscope slides, fossils, rocks, ...)
Collection sizes (from few specimen to millions of individual specimen)
Digitation media (textual data, images, CT scans/3d models, DNA barcodes, ...)
Organisational maturity (from "disorganised" to higher levels of institutional organisation)
Organisation size (many teams vs individual people)
Technical advancement (from manual to semi-automated to almost fully automated and from human made labour to AI and robotics based methods)

Different digitation levels demand different approaches. For example, for massive insect collections, a high level of automation is needed but setting up such an infrastructure for a small rock collection would not be ideal. Thus, it is not possible to propose a single standardised procedure. Instead, this BPD lists a number of recommendations. Each party can use the recommendations to evaluate if they apply to their particular digitation projects.

Furthermore, since institutions operate inside very different infrastructures, this BPD does not recommend any particular software, service providers or other concrete methods on how to implement the recommendations. The recommendations should be considered as goals that institution staring to set up/improve their digitation infrastructure should try to meet.

2.2 Audience

The main target audience of the Community Digitisation Manual has been agreed to be institutions that are at the beginning of building up their digitation process. The recommendations in this BPD are categorized according to their level of advancement from very basic/must have recommendations to more advanced recommendations. To be able to implement the requirements, the organisation must have a certain level of technological capacity and resources (servers, infrastructure): for example a single individual can not successfully implement recommendations of this BPD. More details of the agreed target audience and ALA Digitisation maturity model is in DiSSCo Prepare Milestone 3.5 TODO LINK.

The initial target audience for the community digitisation manual was agreed to be organisations at the ALA Digitisation Maturity Level 1 and 2. At this stage, Maturity Level 0 was considered out of scope, because organisations at this level would require detailed guidance and individualised support.

2.3 BPD Template

To assess the best practice is this work, we use a template defined by Alwazae et al. (Alw2015) to formalise the goals and maturity of this BPD as in Table 1.

Table 1. Overview of this BPD based on the template defined by by Alwazae et al. (Alw2015).

Summary	This BP helps implement Extract-Transform-Load (ETL) procedures for data from digitation stations to their final publishing platforms and how the ETL procedures should fit in the overall DiSSCo infrastructure.
Goal	Applying the BP ensures data is not lost, improves efficiency and makes the data more interoperable.
Means	Successfully applying this BP requires IT specialists with skills in (1) programming, (2) server administration, (3) image processing AND availability of cloud based servers and sizeable storage capacity.
Cost	Technology costs: Institutions may have access to "free" services provided by academic research infrastructures nationally or internationally, or their parent university of other organisation may be able to provide the needed technology. However, someone ultimately has to pay the needed resources. In terms of computation power, a minimal viable digitation ETL process does not incur great costs. Use of advanced methods like OCR or AI techniques require more computation resources. Storing and long-term archiving digital material (images, 3d models) and cost of moving the material in and out of storage can be very costly, up to tens of thousands EUR/year. TODO: Ask for partners about their figures. Personnel costs: TODO
Barriers	TODO: Obstacles or problems that may occur before, during, and after applying the BP.
Barrier Management	TODO: Procedures to follow if certain obstacles or problems are encountered.
Acceptability	This document is not yet reviewed by domain experts and it is not yet assessed if it helps to resolve the problem addressed by the BP. / 2022-03
Usability	TODO: Ask for feedback to which degree this BP is easy to use.
Comprehensiveness	This document does not describe a comprehensive BP. Instead, various individual recommendations are listed. We have selected the most important recommendations for institutions that are at the beginning of building up their digitation process.
Justification	This BP has not yet been used by any institution. / 2022-03
Prescriptiveness:	This BP offers concrete proposals for solving the problems; however the underlying infrastructures vary so much that only examples if actual implementations can be provided.
Coherence	This BP does not form a coherent unit: certain parts only apply to certain types of digitation efforts.
Consistency	This BP is (TODO WILL BE) consistent with existing knowledge and vocabulary used in digitation of natural science collection sector and knowledge domain as leading experts on the field have participated and reviewed the living document / Not yet done at the moment 2022-02
Granularity	TODO: The degree to which the BPD is appropriately detailed
Adaptability	TODO: The degree to which the BP can be easily modified and adapted to other situations
Integration	TODO: The degree to which the BP is integrated with other BPs and KM components
Demonstration of Success	See end of this document / None so far / 2022-02

3. Review of Digitisation Workflows

To identify the ETL procedures in the digitisation processes, we did a review of digitisation workflows from available publications and reports, and project partners’ documents. The extensive list of the related literatures is in Appendix. We extracted steps/procedures from those workflows and tried to list them into three catalogues in regarding with ETL procedures (before, within, and after it). Those lists can be seen from the Appendix.

The outcome lists of digitisation steps are not recommended workflows, or even functional workflows, rather it is union of all potentials workflows in digitisation of natural history collections. It is a tool used in mapping the landscape so that BP recommendation on ETL can be created from the different steps/processes/procedures.

3.1 Infrastructure

Infrastructure is the one of the most fundamental core parts in the digitisation processes. This is the first step to building process and requires careful plan. It affects the following steps of the data flows. Therefore, good practice should be followed on setting up the digitalisation infrastructure. Within the scope of the ETL processes, the infrastructure involve the local digitisation station, remote servers, CMS system, data backup system, etc. Generally, there are following components

Local storageat the digitation site to which digitation line/other hardware connects to - typically a local machine (not a server)
Staging area to which raw digitalized material is transferred from the local machine for processing - NOT meant to store data for longer periods of time - server based
Image archiveto which large original TIFF (etc) files are stored - cloud based
Publishing platform file storage(image server) to witch ready material is transferred so that it is accessible from the web - cloud based
CMS/data repository(relational or other database) to which specimen data, image metadata etc is stored - cloud based
Backup storageto which resources from image archive, publishing platform are periodically backed up)

3.2 Organisational Models

Different institutions have different organisation structure on the digitisation. Some may have own in-house IT development staffs. Some may have outsourced digitisation contractors. For example, at The Finnish Museum of Natural History (Luomus), the digitisation workflow involves digitisation team and three IT staff, one digitisation software developer/manager, and one server administrator, and a data manager (Luo1). There are digitisers and one IT manager at Herbarium LISU (LIS1). And for The Natural History Museum London, digitisers and data management team are involved in the digitisation workflows. Different models have its advantages and disadvantages, which should be well considered before the starting of digitisation actives based on with institutions’ own situations. We made on BP of recommendations in the next chapter.

3.3 Data Models

There are variants in the data models in different workflows.

Luo1: Specimen data does not have information about what media there are about them (in CMS - this information is available in search engine (elastic search) and FinBIF national warehouse); media are queried real time from Image-API using the specimen id

LIS1: Links to media are stored to specimen data

Todo: Include NHM London work (Sco2019) and expand the texts

3.4 Identified Workflow Procedures

3.3.1 Before ETL Workflows

The before ETL workflows are mostly on the digitisation stations, as shown in the Appendix Before ETL workflows. Most of those workflows are related to the barcoding of the specimens, imaging, quality control, image processing, metadata generation and upload. It also involves the data transmission from the digitisation station to the staging area, CMS, image publishing platform, and data backup. The objects are identifiers, image, image metadata, and specimen data. There are different actions, such as manual, semi-automated, automated on those workflows. Some of the actions require the manual work.

3.3.2 ETL Workflows

The Appendix ETL workflows lists the examples of ETL workflows. ETL workflows are mostly on the staging area. They are doing image pooling, data quality control, file renaming, data export and publishing, image conversion, and data backup. The data will go to CMS, Image publishing platform, image archive, based on the workflow. Most of the workflows here are done semi-automated or fully automated. To automate those workflows as possible will increase the efficiency of the digitisation process and decrease the potential risks of human mistakes.

3.3.3 After ETL Workflows

In the Appendix After ETL workflows, it lists all those After ETL workflows we found. They are mostly on CMS, Image archive, and data backup storage. They are related to the data backup, data linkage, and data enrichment. There are different actions, such as manual, semi-automated, automated on those workflows. Some of the actions here still require the manual work.

4. BEST PRACTICES

This section lists the Best Practices (BP) recommendations we have been able to determine in the reviews of digitisation workflows done in chapter 3. The template of recommendations is given in Table 2.

Table 2: Template of Best Practices

Id	EXAMPLE1
Level	BASIC \| ADVANCED \| STATE-OF-ART
Use case	As xxx I want to xxx so that I can xxx
Best practice recommendation	Procedure to follow/task to accomplish that fulfils the use case
Discussion	Rationale behind the recommendation
Implementation example	One or few references/examples on how the recommendation has been implemented in practice if applicable
References	Link, Ref

The explanations of the items in the BP template are in the Table 3.

Table 3: Explanations of the Template of Best Practices

Id	To make it easier to communicate about an individual recommendation
Level	Level: how demanding the recommendation is BASIC: A fundamental goal that everyone doing digitation should try to fulfil ADVANCED: Next steps in automating and improving performance STATE-OF-ART: New upcoming techniques that should perhaps not be attempted to take into account at first
Use case	An use case which acts as a motive for the recommendation
Best practice recommendation	Procedure to follow/task to accomplish that fulfils the use case
Discussion	Discussion about the rationale of the recommendation Implementation example
Implementation example	One or few references/examples on how the recommendation has been implemented in practice if applicable
References	Links to external documentation or publication which is the source of the use case or recommendation and/or to implementation example

4.1 Infrastructure recommendations

Id	INFRA1
Level	BASIC (+STATE-OF-ART)
Use case	As digitation manager I want no significant data loss to occur and have a reliable system so that digitation process is not delayed
Best practice recommendation	Your digitation/ETL/publishing/CMS infrastructure should generally have the following components · Local storage at the digitation site to which digitation line/other hardware connects to - typically a local machine (not a server) · Staging area to which raw digitalized material is transferred from the local machine for processing - NOT meant to store data for longer periods of time - server based · Image archive to which large original TIFF (etc) files are stored - cloud based · Publishing platform file storage (image server) to witch ready material is transferred so that it is accessible from the web - cloud based · CMS/data repository (relational or other database) to which specimen data, image metadata etc is stored - cloud based · Backup storage to which resources from image archive, publishing platform are periodically backed up (see recommendation INFRA2) · STATE-OF-ART: Long-term archive to which all data is eventually replicated to be stored "forever" (see recommendation INFRA3) ·
Discussion	· Local storage: Data is not mean to stay for long time at the local digitation station. It should be moved daily or at least weekly forward. Loss of these stations does not incur significant data loss. Setting up the environment again may take a long time and mean delays in digitation process. Docker(etc) image based environment is recommended so that it is quickly set up on any new local computer (may not always be possible because of software licenses etc) · Staging area: ETL procedures may require computing power which is best done on a server / computing clusters rather than on the local machine; procedures are automated and software driven so ease of deploying new versions is a benefit. State-of-art environment would for example be a Kubernetes container cluster to which different ETL process steps are deployed as individual services/pods and co-operate to provide the ETL procedure. A test environment exist where software is tested before putting to production. · Image archive should be cloud based to prevent data loss. Hard disk failures are common, which can be negated running a RAID disk server. However, we do not recommend institutions to run their own disk servers or any other servers, as cloud based services are more cost efficient, professionally managed and data loss is almost impossible (except as a result of human error - so backups are still needed). It is a good idea to separate the live-publishing server data storage (containing smaller JPGs etc) and the original raw data (TIFF etc). This allows for example to use faster disk for publishing. Furthermore, as data in image archive is not needed often, it does not need to accessible from the internet. It can be for example an object storage database instead of a conventional file system. · Publishing platform file storage: Here uptime and performance are important as is prevention of data loss (which causes downtime). We recommend a cloud based service for these mentioned reasons. · CMS/data repository: Data loss in your CMS database would be catastrophic. It needs to be professionally administrated, backed up. Cloud based solution is a must. Databases contain text and do not typically take much space. Regular backups should be done in professional manner. · Backup storage: Even if original data is located on cloud based servers, data loss can occur as a result of human error. It is problematic to find another large enough place to put your biggest data: finding a suitable place for the image archive can be difficult and for backup there would be a second location, as having data twice on the same service doesn't quite fit the need. If no other solution can be found, image archive and backup storage can reside in the same service, which at least helps in case of human made accidental deletion. · Long-term archive: LTA would be a third place your data resides. It doesn't always fulfil the function of backup storage, as data is stored to LTA in formats that are designed to be ever-lasting and may be somehow modified as a result. It might not be easy to recover data from LTA as getting lots of data out from LTA is not typically what they are designed for. LTA is almost impossible to implement by your own institution, so you should seek for research infrastructures that can provide the service for you. We have marked LTA to be "STATE-OF-ART" (very demanding) using this BPD's three level scale. It is not something you should try to set up first.
Implementation example	Luomus: · Local storage: Helsinki University IT centre provides local work stations, administrates security, network, user accounts etc · Staging area: Finnish IT Centre for Science (CSC) provides virtual servers (cPouta; OpenStack based) · Image archive: CSC research data storage service (IDA) - for even larger 3d scans in the future CSC object storage (Allas); providing space in petabytes; not based on conventional file system · Publishing platform file storage: CSC virtual server mounted disk (cPouta; OpenStack based) · CMS/data repository: Helsinki University IT centre provided Oracle database (running on their OpenStack based virtual server environment) · Backup storage: For publishing platform images: Helsinki University provided disk; for Image archive: none so far · Long-term archive: Not yet implemented; will be at CSC provided national service (Digital Preservation Service (DPS))
References	TODO

Id	INFRA2
Level	BASIC
Use case	TODO
Best practice recommendation	TODO: Backups
Discussion	TODO
Implementation example	TODO
References	TODO
Id	INFRA3
Level	BASIC
Use case	TODO: Long-term archiving (LTA)
Best practice recommendation	TODO
Discussion	TODO
Implementation example	TODO
References	TODO

Id	INFRA4
Level	BASIC
Use case	TODO
Best practice recommendation	Do not user SSD disk on local servers / staging areas
Discussion	On digitation stations, high volumes of data is always coming in and then deleted. SSD disk have limited number of reads they can do. Use traditional disks.
Implementation example	TODO
References	TODO

TODO: information security recommendations...

TODO: clean up of digitation stations

4.2 Organisational recommendations

Id	O1, O2
Level	BASIC
Use case	As museum director I want to use limited monetary resources efficiently so that I can provide best value to society.
Best practice recommendation	O1: Automate recurrent routine task as much as possible as part of the ETL process. O2: Employ/acquire one or few software developers instead of adding more digitation staff to speed up digitation.
Discussion	Software development is expensive, but spending development resources in automating tasks will eventually save money by reducing staff costs (or allowing to use those staff more efficiently).
Implementation example	Instead of having staff manually create thumbnails with an image editor, develop an image service that does the job; use existing image libraries available (such as ImageMagick).
References	All2019; TODO
Id	O3 (TODO - subjective - needs discussion)
Level	BASIC
Use case	As digitation manager I want to prioritise digitation efforts based on scientific criteria instead of existing procedures so that I can provide that information which is most needed to research
Best practice recommendation	O3: Maintain sufficient in-house skills in IT (software development and server administration)
Discussion	TODO: More subjective than most recommendations. Digitation is a continuously changing field where advances are continuously made. New kind of digitation projects start end as digitation moves to different types of collections. This means changes to existing processes are needed often. Buying services from outsourced party may not be flexible enough. On the other hand, should an excellent partner exist, they could be able to keep up to technological advancement better than small institutions own IT staff. Ordering software and services from outside partner requires IT skills and knowledge so you can order and explain to the tech people what exactly is the service you need. TODO: Weight PROs and cons of the approaches
Implementation example	TODO
References	TODO

4.3 Identifier recommendations

PLACEHOLDER

ID1: OpenDS identifier minting

TODO: multiple specimen NHM8 has workflow

4.4 Image transformation recommendations

PLACEHOLDER

jpeg originals

thumbnails (jpeg or png)

zoomify files

raw data is archived (tiffs etc)

4.5 Specimen data recommendations

Id	DD1, DD2
Level	ADVANCED
Use case	As researcher I want to know if data is reliable/complete so that I can determine if it can be included to my research.
Best practice recommendation	DD1: When data is extracted from digitalisation platform to CMS, make sure the information about fields marked as empty/missing and field not databased is not lost in transition/mixed with each other. DD2: If OCR is applied during ETL process, the CMS should support marking data field to be "automatically filled" and ETL process should make sure to fill in this information.
Discussion	Data field value can be one of the following: TODO needs more work 1. absent: information has not been documented at time of collection event and can not be later resolved 2. unknown: information is documented but is not yet databased 3. unknown:missing: the information would have been databased but is absent 4. unknown:indecipherable: the information appears to be present but failed to be captured 5. automatically filled: information has been databased using automated methods (OCR) but not yet cleaned/verified by a human 6. default: information is present and has no known problems 7. erroneous: information is present but contains errors/marked as unreliable by a human 8. unknown:withheld: information is databased but has been withheld by the provider (Note: not a factor for ETL processes; this is data publishing problem)
Implementation example	Not known to be fully implemented? TODO
References	Interoperability of Collection Management Systems, p5 recommendation #8 (Dil2019) Improved standardization of transcribed digital specimen data, table 2 (Gro2019)

PLACEHOLDER

4.6 Quality control recommendations

PLACEHOLDER

4.7 Media metadata recommendations

PLACEHOLDER

What fields SHOULD be present in the Metadata

What fields SHOULD NOT be present (if any — for occurrence images it is important to remove coordinate information for sensitive species but for digitation stations there is no information that could not be shared?)

4.8 OCR recommendations

PLACEHOLDER

4.9 Crowdsourcing recommendations

PLACEHOLDER

4.10 CT scans / 3D model recommendations

PLACEHOLDER

4.11 Analytical/chemical/molecular data recommendations

PLACEHOLDER

REFERENCES

All2019: Allan L et al. (2019) Digitisation using Automated File Renaming and Processing. Microscopes Slides. (TODO PUBLISHED?)

Alw2015: Alwazae M., Perjons E, & Johannesson P (2015) Applying a Template for Best Practice Documentation. Procedia Computer Science 72 (2015) 252 – 260. https://doi.org/10.1016/j.procs.2015.12.138

Dil2019: Dillen M, Groom Q, & Hardisty A. (2019). Interoperability of Collection Management Systems. Zenodo. https://doi.org/10.5281/zenodo.3361598

Dri2014: Drinkwater R, Cubey R, Haston E (2014) The use of Optical Character Recognition (OCR) in the digitisation of herbarium specimen labels. PhytoKeys 38: 15-30. https://doi.org/10.3897/phytokeys.38.7168

Gro2019: Groom Q et al. (2019) Improved standardization of transcribed digital specimen data. Database, Volume 2019, 2019, baz129. https://doi.org/10.1093/database/baz129

Has2012a: Haston E, Cubey R, Pullan M, Atkins H, Harris D (2012) Developing integrated workflows for the digitisation of herbarium specimens using a modular and scalable approach. ZooKeys 209: 93-102. https://doi.org/10.3897/zookeys.209.3121

Has2012b: Haston E, Cubey R, & Harris D J (2012) Data concepts and their relevance for data capture in large scale digitisation of biological collections. IJHAC, Volume 6, Issue 1-2. https://doi.org/10.3366/ijhac.2012.0042

Har2020: Hardisty A, Saarenmaa H, Casino A, Dillen M, Gödderz K, Groom Q, Hardy H, Koureas D, Nieva de la Hidalga A, Paul DL, Runnel V, Vermeersch X, van Walsum M, Willemse L (2020) Conceptual design blueprint for the DiSSCo digitization infrastructure - DELIVERABLE D8.1. Research Ideas and Outcomes 6: e54280. https://doi.org/10.3897/rio.6.e54280

Hid2020: Nieva de la Hidalga A, Rosin PL, Sun X, Bogaerts A, De Meeter N, De Smedt S, Strack van Schijndel M, Van Wambeke P, Groom Q (2020) Designing an Herbarium Digitisation Workflow with Built-In Image Quality Management. Biodiversity Data Journal 8: e47051. https://doi.org/10.3897/BDJ.8.e47051

Sco2019: Scott B, Baker, E, Woodburn M, Vincent S, Hardy H, Smith V S (2019) The Natural History Museum Data Portal, Database, Volume 2019, 2019, baz038, https://doi.org/10.1093/database/baz038

APPENDIX: REVISION HISTORY

TODO: Flatten initial version history to first published version after release

Version / date	Author(s)	Description
Version 0.1 / 2022-02-07	Esko Piirainen	Initial template, compilation of different workflows, skeletal for best practices

APPENDIX: REVIEW HISTORY

Review date	Reviewed version	Reviewer	Notes

APPENDIX: IMPLEMENTATION DEMONSTRATIONS

Template for reporting BP was applied in an organization

Implementing organization	Name of org, contact person, contact info
Implementation time	Start date, end date
Implementation cost	How many person work months implementation took
Experiences and feedback
Measurements	See appendix: Measurement Report measurable improvements in performance

The BP has not been currently demonstrated in practice.

APPENDIX: MEASUREMENT

PLACEHOLDER: Indicators for measuring the quality and performance of the BP

APPENDIX: Workflows / documentation provided for this WP

Link	Organisation	Desc	Ref	Done	Notes
PDF	Luomus	Workflow for insect-line mass digitisation process Workflow for non-mass digitisation processes	Luo1	x
HTML	Luomus	Plans on how CT scan/3d model workflow will happen	Luo2	x
PDF PDF v2 Doc v2	LISI Inst de Agronomia - Univ de Lisboa	LISI Herbarium Digitization Workflow	LIS1	x	TODO: Contains list of image metadata fields for recommendation
PDF Doc	RBGE Royal Botanic Garden Edinburg	RBGE Digitisation Workflows	RBGE1	x	TODO: Contains OCR workflow for OCR best practices
PDF Doc	RBGE Royal Botanic Garden Edinburg	RBGE ETL Processes	RBGE2	x	TODO: Contains list of metadata fields for recommendation
Article	RBGE Royal Botanic Garden Edinburg	Developing integrated workflows for the digitisation of herbarium specimens using a modular and scalable approach	Has2012a
Article	RBGE Royal Botanic Garden Edinburg	Data concepts and their relevance for data capture in large scale digitisation of biological collections	Has2012b
Article	RBGE Royal Botanic Garden Edinburg	The use of Optical Character Recognition (OCR) in the digitisation of herbarium specimen labels	Dri2014
PDF Doc	NHM, London	Summary of other doc + specimen data to CMS	NHM1	x
PDF Doc	NHM, London	Slide Digitisation - End of day checklist	NHM2	x
PDF PDF	NHM, London	eMesozoic workflow diagram	NHM3	( x )	Needs clarifying, see notes in workflow bellow
PDF Doc	NHM, London	ALICE Workflow	NHM5		Understanding this workflows would require some hand holding
PDF Doc	NHM, London	Microscope slides digitisation - article	All2019	( x )	Still may have some workflow steps not covered
PDF PPTX	NHM, London	Airless workflow diagram	NHM7		Needs explanation
PDF Doc	NHM, London	Bee types digitisation workflow	NHM8	( x )	TODO: has workflow for multiple specimen for recommendation reference Still may have some more steps that were not understood
Article	NHM	The Natural History Museum Data Portal	Sco2019
Article PDF	Meise Botanic Garden	Designing an Herbarium Digitisation Workflow with Built-In Image Quality Management	Hid2020
PNG	Meise Botanic Garden	Image processing, storage diagram	MEISE2		(Diagram texts are brief but some info can be extracted)
PDF GoogleDoc	Meise Botanic Garden	Botanical Collections Data Portal - publishing pipelines	MEISE3	x ?	If understood correctly, this is out-of-scope for digitation process Deals with data publication to national portal+GBIF
GoogleDr	Museum für Naturkunde Berlin	Workflow diagrams	MfN1		(Diagram texts are brief but some info can be extracted)
HTML/txt	Museum für Naturkunde Berlin	MfN worklow ETL summary	MfN2	x

APPENDIX: Literatures with digitisation workflows

Link	Organisation	Name	Ref	Done	Notes
Article	ICEDIG	Interoperability of Collection Management Systems	Dil2019
Article	ICEDIG	Quality Management Methodologies for Digitisation Operations
Article	ICEDIG	Mass-imaging of microscopic and other slides
Article	ICEDIG	Best practice guidelines for imaging of herbarium specimens
Article	ICEDIG	State of the art and perspectives on mass imaging of pinned insects
Article	ICEDIG	State of the art and perspectives on mass imaging of liquid samples
Article	ICEDIG	State of the art and perspectives on mass imaging of skins and other vertebrate material
Article	ICEDIG	Methods for Automated Text Digitisation
Article	ICEDIG	Conceptual design blueprint for the DiSSCo digitization infrastructure	Har2020
Article	NCSU	Results and insights from the NCSU Insect Museum GigaPan project
Article	NHM	No specimen left behind: industrial scale digitization of natural history collections
Article	INHS	InvertNet: a new paradigm for digital access to invertebrate collections
Article	Swiss Aca of Sci	Handbook on natural history collections management – A collaborative Swiss perspective
Article		Improved standardization of transcribed digital specimen data	Gro2019
Article	Uni Coimbra	A Strategy to digitise natural history collections with limited resources
Article		Back to the future: A refined single-user photostation for massively scaling herbarium digitization
Article	NHM	Georeferencing the Natural History Museum's Chinese type collection: of plateaus, pagodas and plants

APPENDIX: Before ETL workflows

Infra	Step	Action type	Type	Ref	Notes
Digi station	Fully qualified URI Identifier of the specimen (globally unique persistent identifier) is present as QR-Code on the imaged specimen	Manual (repeated for each specimen)	Identifier	Luo1	Doc says "barcode" but barcode != qr-code; Best practice is to have the full URI as QR-Code. Luomus has QR-codes.
Digi station	Barcode is created / scanned	Manual (repeated for each specimen)	Identifier	LIS1	Most likely internal catalogue number (not fully qualified URI identifier) based on rest of the workflow
Digi station	Apply barcodes / scan barcodes	Manual (repeated for each specimen)	Identifier	RBGE1	Unclear if fully qualified URI identifier or internal catalogue number
Digi station	Before capturing image, specimen data is entered Camera operator enters their details into an online form that queries the imaging database to check if there are records for the specimen barcodes already.	Manual + Semi-automated (repeated for each specimen)	Specimen data	RBGE1 RBGE2
Digi station	Images are taken in RAW format using CaptureOne software. The barcode on the specimen is scanned by the camera operator and used as the filename for the image. A mask for the crop is applied manually by the camera operator	Manual (repeated for each specimen)	Image + Identifier	RBGE2
Digi station	The operator selects image(s) and these are processed to TIF format by CaptureOne software. As part of this conversion process the image is cropped to the mask applied by the camera operator and sharpening is applied to the TIF.	Manual (repeated for each specimen)	Image (Transformations)	RBGE2
Digi station	After imaging the dorsal and lateral views, the images have to be rendered (Helicon Focus) and renamed (BardecodeFiler). Then we need to generate a filelist for the dorsal images (command prompt) in order to associate each UID with the correct PTN. After this is done, we remove the PTN from the name of the image (Bulk Rename Utility) and we crop the images to remove the IRN tags and the dead space (Lightroom) 8. Leave to run overnight	Semi-automated (daily / overnight)	Image (Transformations)	NHM8
Digi station	Something called "Syrup" is done after image capture Rename image file with concatenation of scanned specimen barcode and drawer barcode, plus incremental suffix for more than one image of item	Automated (presumed) (on-the-fly)	Image + Identifier	NHM3	Level of automation?
Digi station	Before capturing image, "System quality control" is done	Automated (on-the-fly?)	Specimen data? (Quality control)	RBGE1	What is controlled?
Digi station → CMS	System creates a record for each new barcode an populates record with data	Automated (on-the-fly?)	Specimen data	RBGE1
Digi station → CMS?	The metadata are managed in a MySQL image data management database, and in the image file exif data. The metadata for the original image files are held in one table, and comprise information copied from the image exif along with additional metadata derived from a folder structure developed for capturing metadata not included in the exif data. A watched folder for the image files sits within a hierarchical folder structure with each folder name holding the relevant metadata for the image file.	Automated?	Image metadata	RBGE2	Is the mySQL a temporary image metadata repository or also the final one? See doc for exact metadata fields
Digi station	The camera operator checks to see that there is a pair of images (a RAw & TIF) for each barcode. If either file is missing the images can not be processed, as the image processing service is expecting both.	Manual (repeated for each specimen)	Image (Quality control)	RBGE2
Digi station	Manual checks are done: Look through the list of file names in the final folder - Common errors to check: - duplicate unique identifier (UID) barcodes (i.e. a slide that hasn’t been barcoded when imaged and therefore has been renamed with the UID of the previous slide). - “Missing” numbers from the UID sequence (i.e. a slide was barcoded but then not imaged before returning it to the drawer) Check that there isn’t any “unexpected” _additional images (i.e. images of the front side of a slide; duplicates of envelopes	Manual (end of day)	Image (Quality control)	NHM2
Digi station	We perform the quality checks; Check that all images look alright	Manual	Image (Quality control)	NHM8
Digi station → Staging area	Image is captured and transferred to dropbox Both the RAW and TIF files are saved onto a network share drive. The folder structure for this includes the camera the operator is using and the operator's username. This is a temporary storage location.	Manual (repeated for each specimen)	Image	RBGE1 RBGE2
Digi station → Staging area	Files manually moved to different folders copy the date folder (with the “images” folder within) in final to: emu-import-Slides_SLR_X (\\dfs-ctdb)	Manual (end of day)	Image	NHM2	(Destination seems to be a network drive)
Digi station → Image archive	Copy the date folder (with the “images” folder within) in final to: Emu-import-dcp_digitisation (This is our back-up area)	Manual (end of day)	Image (Backup)	NHM2
Digi station → ?	Copy the date folder (with the “images” folder within) in final to: DCP-1 - EXTERNAL HARD DRIVE (NOTE: It’s going to be tricky for everyone tosave to the hard drive if you’re all leaving at the same time, so you can do this step the next day)	Manual (end of day)	Image (Backup)	NHM2	Reason for the external hard drive?
Digi station	Metadata such as digitiser name/operator is generated and stored at the digitisation station as text file	Automated (on-the-fly)	Image metadata	Luo1
Digi station	Metadata of all images are generated using XnView and a .ipt-template	Semi automated (once a day)	Image metadata	LIS1	x1 - difference between this and x2 is not clear
Digi station	Metadata of all images is generated using Limbs digitization software	Semi automated (once a day)	Image metadata	LIS1	x2 - difference between this and x1 is not clear See doc for exact metadata fields
Digi station → Backup storage?	Copy images+metadata in current day folder to external drive	Manual (once a day)	Image, Image metadata	LIS1	Is the external drive for backup purposes?
Digi station → Staging area	Copy images+metadata to staging area using FileZilla program	Manual (once a day)	Image, Image metadata	LIS1
Digi station → Staging area	Images loaded onto EMu server	?	Image	NHM3	Needs more info
Digi station	Post-Processing can include color corrections and rendering of scale bars	Manual	Image	MfN2
Digi station → Staging area	Manual workflow for 2D imaging on demand: DNG and PNG files are stored in structured file system Manual upload to DAM system after quality check	Manual	Image (+Quality control)	MfN2
Digi station	In case of multi-focus imaging: image acquisition and rendering of multi-focus images are separated steps.	Manual	Image	MfN2
Digi station	Backups are not done at digitisation station	--	Infa (Backup)	Luo1
Digi station	Specimen data is entered to Excel spreadsheet	Manual (repeated for each specimen)	Specimen data	Luo1
Digi station	Object related metadata: - Mostly metadata are acquired with Excel spreadsheets, which are designed for enabling bulk uploads into the CMS	Manual (repeated for each specimen)	Specimen data?	MfN2	"Metadata" == data? Not image metadata?
Digi station → CMS → Image publishing platform	Images are captured and uploaded straight to CMS using Web UI; thumbnails etc are generated by image API; metadata is created and stored; images are moved to image publishing platform	Semi automated (repeated for each image)	Image, Image metadata	Luo1	The "ETL" parts are done automated and instantaneously without a specific ETL part in the workflow
Digi station	Raw scans are done using CT Scanner	Manual (repeated for each specimen)	3d/CT scan	Luo2
Digi station	3d model is generated from raw CT scans	Manual (repeated for each specimen)	3d/CT scan	Luo2
Digi station	A smaller scale 3d model is discretized from the model	Manual (repeated for each scan)	3d/CT scan	Luo2
Digi station → CMS, publishing patform	Small scale 3d model is uploaded straight to CMS using Web UI; thumbnails etc are generated by image API; metadata is created and stored; images and 3d scans are moved to image publishing platform	Semi automated (repeated for each model)	3d/CT scan	Luo2	The "ETL" parts are done automated and instantaneously without a specific ETL part in the workflow
Digi station	Mostly CT images from scientific projects. Processing is mostly done by requesters and/or student helpers. Raw and processed files are stored in the file system and managed by the lab technicians. Upload routines for long-term-archiving and publication are not established yet	Manual (repeated for each specimen)	3d/CT scan	MfN2
Digi station	Multiple specimens · Give one barcode per specimen. · Make sure it is clear which barcode corresponds to each specimen (written on the barcode, examples: male/female, a/b/c, type etc.) · Image as many times as the specimens, each time with only one UID visible (the other ones reversed)	Manual	Identifier, Image (multi specimen)	NHM8
Digi station → Staging area	Automated workflows for data acquisition with mobile devices (vertebrate collections and assessments): We use the app ODK Collect. Data are uploaded to a central ODK server.	Automated		MfN2

APPENDIX: ETL workflows

Infra	Step	Action type	Type	Ref	Notes
Staging area	System polls dropboxes; starts to execute if new files found	Automated (running background task)	Image	RBGE1
Staging area	Quality control is done Checks include: · Filename - the file name is checked for format and length. It should be the letter E followed by 8 numbers. Any additional images for a particular barcode should be suffixed using _. If a filename does not pass this is returned to an Errors folder which is manually checked by a Digitisation Officer. · Filesize - the size of the file is checked, if it falls outside of the set parameters the file is returned to an Errors folder which is manually checked by a Digitisation Officer. · Image pair - whilst a manual check has been performed by the camera operator these can still be missed. If one of the files is missing it is returned to an Errors folder which is manually checked by a Digitisation Officer.	Automated + Manual	Image (Quality control)	RBGE1 RBGE2
Digi station → staging area	Images and metadata are fetched in real-time or in batches to staging area	Automated (on-the-fly OR daily)	Image, Image metadata	Luo1
Staging area?	This script takes individual images with metadata encoded in the filename and creates a specimen record with appropriate attachments to the taxonomy and location modules. Metadata encoded in format: “UIDBarcode_LocationIRN_TaxonIRN.jpg”	Semi-automated ?	Identifier, Image, Image metadata	All2019
Staging area	Specimen identifier URI is detected and extracted from specimen image and image is named to match the ID and image metadata is updated to contain the specimen ID	Automated (running background task)	Identifier, Image, Image metadata	Luo1
Staging area?	Systems perform all processing steps and deliver two image files (Tiff/Raw and Png). All technical and administrative Metadata related to the images are delivered with a json sidecar file (XML in METS format for library and archival material)	Automated (when?)	Image (Transf) Image metadata	MfN2
Staging area? → CMS	Object related metadata are acquired in different ways. In one case they are delivered together with the images in the json sidecar file and parsed by the database management team (this process is not yet fully established). In most cases object related metadata are acquired in Excel spreadsheets and imported to the respective CMS	Automated Semi-Automated (when?)	Specimen data or Image metadata?	MfN2
Staging area → CMS	Images are attached to records in Specify (CMS) based on catalog numbers in file names	Semi-automated (how often?)	Image	LIS1
Staging area? → Image publishing platform	Automated import of image files and related metadata to the digital asset management system	Automated (when?)	Image	MfN2
Staging area → Image publishing platform	Specify script creates copies of original large image files and creates thumbnails (PNG) to Specify Attachment Server and renames based on UUID; original filename and location is kept in attachment metadata	Semi-automated (how often?)	Image, Specimen data (Transformations)	LIS1	Originals are tiff files, about 5574x7370 px,8-bit sRGB Thumbnails are png files, about 93x123 px, 8-bit sRGB
Image publishing platform	Original TIFF files are converted to JPEGs running a Python script that uses ImageMagick library. TIFF images are kept.	Semi-automated (how often?)	Image (Transformations)	LIS1
CMS	Run SQL UPDATE to modify attached TIF files links to point to generated JPEG links instead	Manual (how often?)	Specimen data	LIS1
Staging area	EMu eMesozoic Batch Operation script	?	?	NHM3	Needs more info
Staging area	Each TIFF is processed through OCR software; the OCR output is recorded as unstructured text to CMS as separate record (not to primary specimen data) A copy is made of the TIF file which is submitted to an OCR pipeline.	Automated	Specimen data (OCR)	RBGE1, RBGE2	(to what level automated?)
Staging area → Digi station? → CMS	Batches of records with shared collectors or geography were then transcribed by digitisation staff, using a record set of the data records in the CMS along with an identical set of images presented in the same order in image-viewing software	Semi-automated? (what intervals?)	Specimen data (OCR)	RBGE1
Staging area → Image publishing platform → CMS	Automated workflows for data acquisition with mobile devices (vertebrate collections and assessments): Data are uploaded to a central ODK server. The process for integration into the media repository and the CMS is also automated	Automated? (what intervals?)	Specimen data Image Image metadata?	MfN2
Staging area	Original sized JPG and smaller thumbnails are generated	Automated (running background task)	Image (Transformations)	Luo1
Staging area	Creation of JPG and zoomify files · A high resolution JPG is produced. This is stored in an online accessible repository and can be downloaded from our online catalogue. · A tiled image is created. This is stored in an online accessible repository and can be viewed on the online catalogue.	Automated (running background task)	Image (Transformations)	RBGE1 RBGE2
Staging area?	Image rotation and cropping using XnConvert 12) XnConvert watches the hot folder “renamed”. 13) The renamed image file is copied then rotate 180o (step 1) and cropped to specified coordinates to remove the temporary location and taxon IRN label from the final image (step 2; Figure 5). 16) The cropped image is then automatically to the folder “cropped”. 17) At the end of each day the renamed and processed image files are manually transferred from “cropped” to an “images” folder within a date folder “YYYY_MM_DD” within a “final” folder. 18) The image files in the folders “original_processed” and “renamed” are manually deleted daily	Semi-automated? Automated? (when?)	Image (Transformations)	All2019
Staging area	The metadata for the transformed files produced by the ETL processes are also managed in the MySQL image data management database. Each will have a record of the original file from which it was derived, along with ...	Automated ?	Image metadata	RBGE2	See doc for exact metadata fields
Staging area → Image publishing platform	JPG and zoomify files are moved to Image streaming online service	Automated (running background task)	Image	RBGE1
Staging area → Image publishing platform → CMS	Images loaded into EMu multimedia Load images from source folder into EMu Multimedia. For each unique specimen barcode number in the image file name, spawn a new eMesozoic barcode stub record and attach the image(s) to it. At least: location id attached via location barcode; media id via specimen barcode (???)	Automated (presumed) (on-the-fly?)	Image + Image metadata??	NHM3	Needs more info on level of automation and details
Staging area → CMS	Script takes individual images and attaches them to an existing record by matching the UID (NHMUK barcode) in the filename with an existing record in EMu. Metadata encoded in format: “UIDBarcode_suffix.jpg”	Semi-automated?	Data linking	All2019
Staging area	"Sapphire script" - copy image and location to specimen record Search for the specimen number entered (applying search filters). On matching, copy the attached Location and multimedia to the destination specimen record. New specimen records arise from editing the barcode stub	Automated (presumed) (on-the-fly?)	Image + Specimen Data	NHM3	Needs more info on level of automation

Staging area → Image archive	Original TIFF images are moved to image archive and deleted from staging area	Semi-automated (couple times a week)	Image	Luo1	Done using command line tools but could (should!) be automated in the future
Staging area → Image archive	Archive raw and TIFF files	Automated (what intervals?)	Image	RBGE1
Staging area → Image publishing platform	Generated JPG images including thumbnails are moved to image publishing platform	Semi-automated (couple times a week)	Image	Luo1	Done using command line tools but could (should!) be automated in the future
Staging area → Image publishing platform	URL of published images and other image metadata is stored to image metadata database	Semi-automated (couple times a week)	Image metadata	Luo1	Done running a Python script
Staging area	Backups are not done at staging area	--	Infra (Backup)	Luo1
Staging area → Image publishing platform	The automated pipeline moves images for publication and download.	Automated	Image, Image metadata	RBGE1

APPENDIX: After ETL workflows

Infra	Step	Action type	Type	Ref	Notes
Staging area → Backup storage	"At a later stage" images will be copied to INCD cloud service for backup archiving	?	Image	LIS1	Possibly semi-automated?
Image archive → Long-Term Archive	Images are moved from image archive to long-term archive	TBD	Image	Luo1	Future feature
CMS	CMS starts to show specimen images once images are in publishing platform and the URLs of the images are in image metadata service	Automated	Data linking	Luo1
CMS	A SOLR index is used to link the image files to the data records for display on our online catalogue	Automated	Data linking	RBGE2
Staging area	The camera operator’s perform a second check, once all of the images should have been processed to ensure that this has been successful. This is using the same online form as they used prior to processing the images. If any barcodes are showing as unprocessed then the camera operator can resubmit them for processing again, or pass the issue onto a Digitisation Officer to see if they can identify the reason for this failing. Once the camera operator is satisfied that all of the images have been successfully processed they are deleted from the temporary storage location.	Manual	Image	RBGE2
CMS	Specimen data is upload to CMS using Excel spreadsheet	Manual	Specimen data	Luo1
CMS	Object related metadata: - Mostly metadata are acquired with Excel spreadsheets, which are designed for enabling bulk uploads into the CMS	Manual	Specimen data?	MfN2	"Metadata" == data? Not image metadata?
CMS	Georeferencing, validations etc are done by CMS	Automated	Specimen data (Quality control)	Luo1
→ Backup storage	Images, 3d models are automatically backed up to different cloud server environment Databases (specimen, image metadata) are backed up nightly to tape	Automated	Infra (Backup)	Luo1, Luo2
Image archive	The archive folders are included in regular nightly backups. These are written to tape and taken offline, this is a manual process. They are also manually copied onto external hard drives to provide a backup and an easily accessible version of the data once it has been taken offline.	Manual	Infra (Backup)	RBGE2
CMS	OCR raw data was used as an aid to enhancing minimally database records see Dri2014	TODO	Specimen data (OCR)	RBGE2 TODO: Dri2014
CMS	OCR raw data is picked up as part of the SOLR index and is displayed on our online catalogue as part of the specimens record.	Automated	Specimen data (OCR)	RBGE2
Digi station	Manual clean-up: Once all the images have been saved and backed up you can empty the “Processing” folders:Empty the following folders	Manual (end of day)	Infra (Clean up)	NHM2
Digi station	Scripts, developed in-house for the 2015 pilot, are also currently used for a series of processes known as Flows: (1) bulk transfer of image files from the imaging PC to the data managers, and (2) after ingest into EMu the deletion of the original image files on the imaging PC i.e. clear-down process (Flows; Workflow 3)	Semi-automated	Infra (Clean up)	All2019
?	Lu to import “drawer locations” spreadsheet (Locations module) Lu to import “specimen locations” spreadsheet (Catalogue module) Lu to import condition spreadsheet (Condition module) Lu to import treatment/storage spreadsheet (Processes module) Curators to resolve flagged merges	? (Every three months)	?	NHM7	???
--	At present completely decentralised, following the (niche-)standards of the respective community	--	Specimen data - Analytical	MfN2
--	The RBGE uses the CETAF stable identifiers to track material coming from the Herbarium and the Living collection via a molecular collection management system called EDNA. This is an in-house developed system that is under review.	--	Specimen data - Analytical/chemical/molecular data	RBGE2

DiSSCo Prepare 3.2.2 Best practice standardised Extract, Transform and Load (ETL) procedures

Version 0.0.1 / 2022-02-04

Abstract

Table of Contents

1. Introduction

2. Overview of the Work

2.1 Scope

2.2 Audience

2.3 BPD Template

3. Review of Digitisation Workflows

3.1 Infrastructure

3.2 Organisational Models

3.3 Data Models

3.4 Identified Workflow Procedures

3.3.1 Before ETL Workflows

3.3.2 ETL Workflows

3.3.3 After ETL Workflows

4. BEST PRACTICES

4.1 Infrastructure recommendations

4.2 Organisational recommendations

4.3 Identifier recommendations

4.4 Image transformation recommendations

4.5 Specimen data recommendations

4.6 Quality control recommendations

4.7 Media metadata recommendations

4.8 OCR recommendations

4.9 Crowdsourcing recommendations

4.10 CT scans / 3D model recommendations

4.11 Analytical/chemical/molecular data recommendations

APPENDIX: Workflows / documentation provided for this WP

APPENDIX: Literatures with digitisation workflows

APPENDIX: Before ETL workflows

APPENDIX: ETL workflows

APPENDIX: After ETL workflows

Navigation