6. Documents

Last modified by Salla-M Laaksonen on 2024/09/16 19:13

Data extraction from document format files.

Tools that do not require coding skills:

Coding-based solutions:

  • Textract - Python library to extract text from Word documents, PowerPoint presentations, PDFs, etc.
  • PDFMiner - Python library to extract text from PDF files.
  • tm - R package for various text mining applications, including text extraction.