8. Web pages

Last modified by Veikka Grundström on 2024/02/21 20:06

Web pages - scraping

Web scraping refers to the automated extraction of structured web content. 

Tools that do not require coding skills:

  • Extracting text and full websites:
    • A1 Website Download crawls and archives a full website (free mode / 30 day trial)
    • A1 Website crawler will scrape data from websites into CSV & SQL
    • Text Ripper by DMI - an easy-to-use tool to harvest the text on a web page
    • Dataminer.io is a  Google Chrome and Edge browser extension that helps you crawl and scrape data from web pages and into a spreadsheet. Requires a Google login.
    • Voyant Tools can be used scrape the text and show word frequencies and concordances
    • Octoparse is a powerful commercial scraping tool, offers a 14-day free trial.
    • Conifer is a web archiving service that creates an interactive copy of any web page that you browse
    • Google Sheets can be used to scrape and clean webpage data in small quantities. Some tutorials:
  • Extracting links or structures 

Coding-based solutions:

  • BeautifulSoup is a Python library for pulling data out of HTML and XML files. 
  • Scrapy - an open source and collaborative framework for extracting the data you need from websites.
  • rvest is an R package that helps you scrape (or harvest) data from web pages 
  • Selenium is a framework for browser automation, can also be used for scraping dynamic sites

Command line tools:

  • Headless Chrome is a tool intended for testing purposes for wed developers, but it can also be used to print html pages to pdf in batches, also using your own browsing profile including logins and cookies.
  • wkhtmltopdf and wkhtmltoimage are open source command line tools to render HTML into PDF and various image formats using the Qt WebKit rendering engine