5. Web pages - scraping
Web scraping refers to the automated extraction of structured web content.
Tools that do not require coding skills:
- Extracting text and full websites:
- A1 Website Download crawls and archives a full website (free mode / 30 day trial)
- A1 Website crawler will scrape data from websites into CSV & SQL
- Text Ripper by DMI - an easy-to-use tool to harvest the text on a web page
- Dataminer.io is a Google Chrome and Edge browser extension that helps you crawl and scrape data from web pages and into a spreadsheet. Requires a Google login.
- Voyant Tools can be used scrape the text and show word frequencies and concordances
- Octoparse is a powerful commercial scraping tool, offers a 14-day free trial.
- Conifer is a web archiving service that creates an interactive copy of any web page that you browse
- Google Sheets can be used to scrape and clean webpage data in small quantities. Some tutorials:
- Extracting links or structures
- Link Ripper by DMI - an easy-to-use tool that will extract all the links on a page
- Harvester by DMI - an easy-to-use tool that extracts links from a given html page
Coding-based solutions:
- BeautifulSoup is a Python library for pulling data out of HTML and XML files.
- Scrapy - an open source and collaborative framework for extracting the data you need from websites.
- rvest is an R package that helps you scrape (or harvest) data from web pages
- Selenium is a framework for browser automation, can also be used for scraping dynamic sites