Setting up the environment for web scraping

My first project as part of the AiCore fellowship is collecting data scattered online and creating datasets. I use a web scraper, which is an automated bot to crawl through the internet and extract data.

The most popular libraries used by developers in Python are Beautiful Soup, Scrapy, and Selenium, I am using Selenium for this project.

Why selenium?

Selenium is an automated testing tool for web applications. It is not the most efficient in data collecting, but it lets us easily interact with DOM elements and extract data on the dynamic page of the target site.

How to install

The process might differ depending on the environment, but in this post, I am using the environment listed below.

Linux(Ubuntu)
Miniconda3 (docs.conda.io/en/latest/miniconda.html)
Visual Studio Code (code.visualstudio.com/download)

Step 1. Prepare web drivers

Using the Selenium official website, the web driver drives a browser natively, as a user would, either locally or on a remote machine using the Selenium server. In simple terms, it imitates user actions instructed by the Python code. If you want to use Selenium, you need to download the web driver for the browser you want to use.

1) Firefox

Download gecko driver from the official website(github.com/mozilla/geckodriver/releases).
Extract the file (Remember the location).

2) Chrome

Download chrome driver from the official website(chromedriver.chromium.org/downloads).
Extract the file

When you download the driver, the version of the web browser needs to be matched.

Firefox version: Settings > General > Firefox Updates
Google version: Settings > About Chrome

Step 2. Move the drivers to the right location

Now it is time to move the extracted driver files to the right location so our Python code can find them when we run the code.

To find the location, you put the command below in VS terminal.

echo $PATH

$PATH is a list of file locations related to environment variables. If you put an executable in either one of these directories, you do not need to set the path to the executable / script, but you can run it by its name as a command.

My driver files were in downloads folder, and I moved files to one of the folders(/usr/bin).

$ cd /usr/bin
$ mv /home/yoojin/Downloads/geckodriver .
$ mv /home/yoojin/Downloads/chromedriver .

Step 3. Selenium library setup

The next step is installing the Selenium library. As I am using Miniconda environment, I used conda command.

conda install selenium

Step 4. Run the sample code

It is now ready to run the sample code. Copy the code below and run it on your VS.

Firefox

from selenium import webdriver

driver = webdriver.Firefox()
driver.get('https://www.google.co.uk')

Chrome

from selenium import webdriver

driver = webdriver.Chrome()
driver.get('https://www.google.co.uk')

Troubleshooting

ModuleNotFound Error

Traceback (most recent call last):
  File "/home/yoojin/Documents/Pokemon_scraper/test.py", line 1, in <module>
    from selenium import webdriver
ModuleNotFoundError: No module named 'selenium'

Sometimes when running the code, an error message appears and says "not able to find the library", even though you definitely installed the library.

In this case, it is worth checking you are in the right environment. In Visual Studio, you can change the Python interpreter by clicking it on the bottom-left of the screen, or using the Python: Select Interpreter command from the Command Palette (Ctrl+Shift+P).

Display Error

selenium.common.exceptions.WebDriverException: Message: Process unexpectedly closed with status: 1

This error can come up when trying to run the browser in non-headless mode on a box that doesn't have a display. There are 2 ways to resolve this issue.

The first method is running a driver with headless mode. This means you can't see the actions of the web driver.

Firefox

from selenium import webdriver
from selenium.webdriver import FirefoxOptions

fireFoxOptions = FirefoxOptions()
fireFoxOptions.add_argument('--headless')
driver = webdriver.Firefox(options=fireFoxOptions)
driver.get('https://www.google.co.uk')

Chrome

from selenium import webdriver
from selenium.webdriver import ChromeOptions

chrome_options = ChromeOptions()
chrome_options.add_argument('--headless')
driver = webdriver.Firefox(options=chrome_options)
driver.get('https://www.google.co.uk')

But this is not helpful when you want to see how a web driver works and check the code in real-time. In this case, you can set the display environment variable in the terminal.

$ Export DISPLAY=:0.0

The reason this variable is needed is that you can have multiple X servers running locally, or you may wish to use a remote display. So if the DISPLAY variable is not set, your X11 apps have no idea where you want them to run.

Next Step

Now, all I need for the project is properly set up. The next step is building a demo scraper.

Setting up the environment for web scraping - Part 1

Installing Webdriver and Selenium