Click the start scraping button scrape panelonce started

3. Selector Graph

4. Creating a Selector

Association of Research Libraries 21 Dupont Circle NW, Suite 800, Washington, DC 20036 (202) 296-2296 | ARL.org

1. Introduction to Webscraper.io

1.1. Installing Webscraper.io

Type ‘webscraper.io’ into your URL bar to navigate to the scraper’s website. The site has a wealth of as well as a very active . Webscraper.io regularly updates both of these sections with information that can help resolve specific issues that arise. To install the extension itself, click on the blue ‘Download Free on Chrome Store’ button.

1.2. Navigating to Webscraper.io

Figure 4 - ‘Web Scraper’ panel

The first window that appears when navigating to Webscraper.io is the Sitemap panel (See Creating a Sitemap). A sitemap organizes all the information required for scraping a particular website. It will be blank at install, but once you create sitemaps, they will appear here. The first column lists the ID, or name, of each sitemap. The second column is the URL or web address for the first page of that sitemap.

Figure 5 - Sitemap menu

Sitemaps serve to organize all the information about scraping a particular website in one location. They house the various selectors (See Creating a Selector) and instruct the web scraper what the titles and starting URLs are for them to scrape. To create a sitemap, click on the ‘Create new Sitemap’ button. Then you can either import a previously built sitemap or create a blank sitemap.

A user who has already created a Webscraper.io sitemap has the option to export and share that sitemap with other users to import in their own web scrapers (See Exporting Sitemaps). The ‘Import Sitemap’ button creates a sitemap, which can then be manipulated. Importing a sitemap requires the JavaScript Object Notation (JSON) that another user’s instance of Webscraper.io generated. Clicking on the ‘Import sitemap’ button brings up two text entry fields. The user copies and pastes the JSON, which is formatted in a particular way, into the larger of the two fields.The user can rename the sitemap something distinct from the imported JSON code in the second box to ensure that there is no duplicate sitemap within Webscraper.io. For group projects, it also helps to keep track of the date you imported, a sitemap, or who worked on it, by adding the relevant information to the end of the title.

Figure 7 - Import a sitemap

Once the user enters the title and URL, he or she should click on the ‘Create Sitemap’ button to add it to the web scraper.

Figure 10 - Metadata fields

4. Creating a Selector

The link selector also tells Webscraper.io that there is information on the linked page, which allows users to add selectors to new pages for scraping. Other types that we did not test are HTML selectors and the element selection.

The selector field is the most important field other than the ID. It is where users select the elements on the web page that they want to use.

Type in the name of the selector in the ‘Selector ID’ text box. There are no rules regarding the ID. If the web scraping is part of a larger project outside of the scraping itself, we suggest using the vocabulary set up for that larger project.

Clicking on the ‘Select’ button makes a green highlight appear around the different HTML elements on the web page. It highlights both HTML tags, like h or p tags, and CSS div and container tags. Clicking on an element adds the code to the selector bar. Each green highlight turns red when you select it. The checkbox in the select bar allows users to select multiple types of tags, which is useful for keeping a title with its description. The ‘Done selecting’ button adds the selections to the selector text box. Users can also edit the text, if necessary.

The Multiple checkbox tells Webscraper.io to extract more than one of the selected elements. This is helpful when there are lists or navigation links with more than one of the same tags on the page.

We did not test the Regex box during this project, which enters regular expressions into the export file to manipulate data. We also did not test the delay, which tells the scraper to load the web page for a given amount of time before running the selector.

There is no solution for those selectors that cannot be found, which is bug in the software of which Webscraper.io is aware. A mislabeled selector will still appear, but it must be found manually and then edited to point to the correct Parent selector. The best way to find this type of error is to use the Selector Graph to show all selectors, minus the instance just discussed. Once found, users can navigate to the Parent selector in the selector menu and change it accordingly.

5. Scraping a Website

Figure 16 - ‘Scrape’ button

Users have the option to add either a request interval or a page load delay to the entire scraping process. (See Creating a Selector). With both options, the scraper loads pages with different timing so that websites can load information before the scraper begins extracting information. The time delay is in milliseconds, with a default of 2000. Anything shorter than this may mean that the page has not loaded information for scraping. Both options add time for a page to load in case there is a lot of information, or if there are elements that take more time to load. Once the preferred time is entered, click the ‘Start scraping’ button.

	18

Figure 18 - Scrape window

The web scraper automatically directs to the Browse panel when it is finished. Clicking on the ‘Refresh’ button shows the data preview.

Figure 20 - Browse data

If no data appears, click ‘Refresh.’

	20

Figure 22 - Preview of data

Association of Research Libraries	21

Users can then copy the code by either pressing CTRL+C or right-clicking their mouse and selecting ‘Copy.’ The code can then be pasted as a text file into a word processor to save a copy or into an email for sharing. Any changes elsewhere in Webscraper.io will also change this export, so any previously saved sitemaps will not be accurate.

Association of Research Libraries	22

A downloadable file generates as soon as a user enters this panel, with a blue ‘Download now’ link appearing when the file is ready for download. Once clicked, the file downloads to the

Association of Research Libraries 23