Page tree
Skip to end of metadata
Go to start of metadata

5. Web pages - scraping

Web scraping refers to the automated extraction of structured web content. 

Tools that do not require coding skills:

  • Extracting text and full websites:
    • A1 Website Download crawls and archives a full website (free mode / 30 day trial)
    • A1 Website crawler will scrape data from websites into CSV & SQL
    • Text Ripper by DMI - an easy-to-use tool to harvest the text on a web page
    • Dataminer.io is a  Google Chrome and Edge browser extension that helps you crawl and scrape data from web pages and into a spreadsheet. Requires a Google login.
    • Voyant Tools can be used scrape the text and show word frequencies and concordances
    • Octoparse is a powerful commercial scraping tool, offers a 14-day free trial.
    • Conifer is a web archiving service that creates an interactive copy of any web page that you browse
    • Google Sheets can be used to scrape and clean webpage data in small quantities. Some tutorials:
  • Extracting links or structures 

Coding-based solutions:

  • BeautifulSoup is a Python library for pulling data out of HTML and XML files. 
  • Scrapy - an open source and collaborative framework for extracting the data you need from websites.
  • rvest is an R package that helps you scrape (or harvest) data from web pages 
  • Selenium is a framework for browser automation, can also be used for scraping dynamic sites
     


  • No labels