Selenium Web Scrapping

Aug 22, 2020 Sample code to get all links present on a webpage using Selenium WebDriver with Java. At times during automation, we are required to fetch all the links present on a webpage. Also, this is one of the most frequent requirements of web-scrapping. Aug 30, 2020 Most popular libraries or frameworks that are used in Python for Web – Scrapping are BeautifulSoup, Scrappy & Selenium. In this article, we’ll talk about Web-scrapping using Selenium in Python. And the cherry on top we’ll see how can we gather images from the web that you can use to build train data for your deep learning project. As with every “web scraping with Selenium” tutorial, you have to download the appropriate driver to interface with the browser you’re going to use for scraping. Since we’re using Chrome, download the driver for the version of Chrome you’re using. The next step is to add it to the system path.

Selenium is a widely used tool for web automation. It comes in handy for automating website tests or helping with web scraping, especially for sites that require javascript to be executed. In this article, I will show you how to get up to speed with Selenium using Python.

What is Selenium?

Selenium’s mission is simple, its purpose is to automate web browsers. If you are in need to always execute the same task on a website. It can be automated with Selenium. This is especially the case when you carry out routine web administration tasks but also when you need to test a website. You can automate it all with Selenium.

With this simple goal, Selenium can be used for many different purposes. For instance web-scraping. Many websites run client-side scripts to present data in an asynchronous way. This can cause issues when you are trying to scrape sites in which data you need is rendered through javascript. Selenium comes to the rescue here by automating the browser to visit the site and run the client-side scripts giving you the required HTML. If you would simply use the python requests package to get HTML from a site that runs client-side code, the rendered HTML won’t be complete.

Web Scraping Legal

There are many other cases for using Selenium. In the meantime let’s get to using Selenium with Python.

Python Install Selenium

Before you begin you need to download the driver for your particular browser. This article is written using chrome. Head on to the following URL to download the chrome driver to use with selenium by clicking here.

The next step is to install the necessary Selenium python packages to your environment. It can be done using the following pip command:

Selenium 101

To begin using selenium, you need to instantiate a selenium webdriver. This class will then control the web browser and you can take various actions as if you were the one navigating the browser such as navigating to a URL or clicking on a button. Let’s see how to do that using python.

First, import the necessary modules and instantiate a selenium webdriver. You need to provide the path to the chromedriver.exe you downloaded earlier.

After executing the command, a new browser window will open up specifying that it is being controlled by automated testing software.

In some cases, you get an error when chrome opens and needs to disable the extensions to remove the error message. To pass options to chrome when starting it, use the following code.

Now, let’s navigate to a specific URL, in our case that will be google’s homepage by executing the get function.

Locate, Enter a Value to TextBox

What do you do on google? You search! Let’s use selenium to perform an automated search on google. First, you need to learn how to locate items.

Selenium provides many options to do so. You can find web elements by ID, Name, Text and many others. Read on here to get the full list.

We will be locating the textbox by name. Google’s input textbox has a name of q. Let’s find this element with Selenium.

Once this element is found, enter your search to it. We will search for this site by executing the following method.

Lastly, send an “Enter” command as you would from your keyboard.

Wait for an Element to Load

As mentioned earlier, many times the page you are browsing to doesn’t completely load at first, rather it executes client-side code that takes longer to load and you need to wait for these to load before continuing. Selenium provides functionality to achieve this by using the WebDriverWait class. Let’s see how to do this.

TipRanks.com is a site that lets you see the track record and measured performance of any analyst or blogger you come across. We will browse to Apple’s analysis page which upon accessing runs javascript to generate the charts. Our code will wait until these are generated before continuing.

First, we need to import additional modules for our sample such as By, expected_conditions and the WebDriverWait class. ExpectedConditions provide functionality for common conditions that are frequently used when automating web browsers for example to detect the visibility of elements.

After accessing the page, we will wait for a max of 10 seconds until a specific CSS class becomes visible. We are looking for the span.fs-13 that becomes visible until charts complete loading.

Get Page HTML

Once the driver has loaded a page and its rendered completely, either by waiting for elements to load or just navigating to the page. You can extract the page’s rendered HTML quite easily with selenium. This can then be processed using BeautifulSoup or other packages to get information from them.

Run the following command to get the page HTML.

Conclusion

Selenium makes web automation very easy allowing you to perform advanced tasks by automating your web browser. We learned how to get Selenium ready to use with Python and its most important tasks such as navigating to a site, locating elements, entering information and waiting for items to load. Hope this article was helpful and stay tuned for more!

What this is for: Scraping web pages to collect review data and storing the data into a CSV
Requirements: Python Anaconda distribution, Basic knowledge of HTML structure and Chrome Inspector tool
Concepts covered: Selenium, Error exception handling
Download the entire Python file

In an earlier blog post, I wrote a brief tutorial on web scraping with BeautifulSoup. This is a great tool but has some limitations, particularly if you need to scrape a page with content loaded via AJAX.

Enter Selenium. This is a Python library that is capable of scraping AJAX generated content. Before we continue, it is important to note that Selenium is technically a testing tool, not a scraper.

That said, Selenium is simple to use and can get the job done. In this tutorial, we’ll set up a code similar to what you would need to scrape review data from a website and store it in a CSV file.

Install Selenium library

First, we’ll install the Selenium library in Anaconda.

Click on your Start menu and search for Anaconda Prompt. Open a new Anaconda Prompt window.

Change the directory to where you have Anaconda installed. For example

Next, type

It will take a moment to load and ask for consent to install. Once installed, open Anaconda Navigator and go to the Environment tab. Search packages to make sure it installed.

We’ll also need to install Chromedriver for the code to work. This essentially lets the code take control of a Chrome browser window.

Chromedriver is available for download here. Extract the ZIP file and save the .EXE somewhere on your computer.

Getting started in Python

First we’ll import our libraries and establish our CSV and Pandas dataframe.

Next we’ll define the URLs we want to scrape as an array. We’ll also define the location of our web driver EXE file.

Because we’re scraping multiple pages, we’ll create a for loop to repeat our data gathering steps for each site.

Selenium has the ability to grab elements by their ID, class, tag, or other properties. To find the ID, class, tag or other property you want to scrape, right click within Chrome browser and select Inspect (or you can press F12 to open the Inspector window).

In this case we’ll start with collecting the H1 data. This is simple with the find_element_by_tag_name method.

Cleaning strings

Next, we’ll collect the type of business. For this example, the site I was scraping needed this data cleaned a little bit because of how the data was stored. You may run into a similar situation, so let’s do some basic text cleaning.

When I looked at the section markup with Chrome Inspector, it looks something like this:

In order to send clean data to the CSV, we’ll need to remove the “Categories:” text and replace line breaks with a pipe character to store data like this: “Type1|Type2”. This is how we can accomplish that:

Scraping other elements

For the other elements, we’ll use Selenium’s other methods to capture by class.

Now, let’s piece all the data together and add it to our dataframe. Using the variables we created, we’ll populate a new row to the dataframe.

Handling errors

One error you may encounter is if data is missing. For example, if a business doesn’t have any reviews or comments, the site may not render this div that contains this info into to the page.

Selenium Web Scraping

If you attempt to scrape a div that doesn’t exist, you’ll get an error. But Python lets you handle errors with the try block.

Selenium Web Scraping

So let’s assume our business may not have a star rating. In the try: block we’ll write the code for what to do if the “starrating” class exists. In the except: block, we’ll write code for what to do if the try: block returns an error.

Selenium Web Scraping Documentation

A word of caution: If you are planning to do statistical analysis of the data, be careful how you replace error data in the “except” block. For example, if your code cannot find the number of stars, entering this data as “0” will skew any data because there is a difference between having a 0-star rating and not having a star rating. So for this example, data that returns an error will produce a “-” in the dataframe and CSV file instead of a 0. Itunes app free download.

Selenium Web Scraping Nodejs

in Analytics / Marketing 0 comments