Data formats for textual data

Last modified by 14zunde on 2024/02/13 07:41

Only some basic information about digital formats for textual data and their pros and cons can here be provided. Those currently most often used for text data are:

plain text (file suffix .txt): The text of each witness in the tradition may be represented as a single plain text file, or one witness may be printed on one line and the entire tradition may be kept in one file. Plain text files do not allow the encoding of formatting like italics, bold, font size or the like. A plain text file uses an encoding system, such as Unicode (e.g., UTF-8, UTF-16), ISO, MS Windows, or Mac. Accessing a text document using a wrong encoding system will result in garbled content. UTF-8 unicode is the de facto standard today for European alphabets.

markup text (.html and .xml): Adding tags (either opening and closing to define an action [like printing it in italics] on the text in-between <tag> .... </tag>, or as a single "self-closing tags like <tag/>) allows to add formatting to plain text files. Internet pages are of this kind (html), whereas .xml especially according to the tagging rules defined by the Text Encoding Initiative (TEI) are now de facto standard for storage and interchange of written documents. These formats can be easily manipulated by a plain text editor on any platform, in contrast to the following formats that require specialised software:

OpenDocument text document (.odt) is the open standard format used in open-source word processing programs (like OpenOffice and LibreOffice). It uses its own markup for the text and saves all additional data (like pictures) in a compressed file (using the zip algorithm). Its advantage over xml is that it can be easily visualised in most text editing software (now also including MS Word) and it allows many types of markup for text layout (page sizes, margins, etc.) and styles (headings, italics, fonts and font sizes, etc.).

Microsoft Word (.doc or .docx; including many different sub-versions) are formats owned by Microsoft. They require a Word-compliant word processing software to access the content. Its uses and functionality are similar to odt. Standard word processing programs (like OpenOffice or Word) can export data as plain text or mark-up text (html, some versions of xml), this may be a necessary step to make the data readable for specialised stemmatological software.

Spreadsheets, that is tables with rows and columns that may contain numbers, formulas or text, use their own formats:

Comma separated values (.csv) is a plain text (cf. above) format for spreadsheet data. It can represent only the textual content but no formatting information. The words can be separated by commas (','), tabulators, spaces, or other special characters. The words can also be enclosed in delimiting characters such as parentheses.

OpenDocument spreadsheet document (.ods) is the open standard format for spreadsheet documents. A textual tradition can be organised so that each text occupies a single row and each word occupies a single column from left to right. Adding gaps in suitable places in each of the texts so as to ascertain that the same or comparable words in each of the texts are places in the same column is called alignment. This can be done either manually in a spreadsheet program or automatically using alignment tools. Alternatively, each text may be placed in a single column in the document so that each word takes a single row, starting from the top, and different witnesses can be put into different columns.

Microsoft Excel (.xls or xlsx; including many different sub-versions) belongs to Microsoft and requires that Excel-compliant spreadsheet software is used to access the content. It can be used as an alternative to OpenDocument. Again standard word processing programs can export spreadsheet data as .csv (which may be formatted into plain text), which may be a necessary step to make the data readable for specialised stemmatological software.