Web Scraping Using Selenium And Beautifulsoup Python

Using Selenium Python
Python 3 Web Scraping
Web Scraping Using Selenium And Beautiful Soup Python Pdf
Python Web Scraping Beautifulsoup

In the last tutorial we learned how to leverage the Scrapy framework to solve common web scraping problems.Today we are going to take a look at Selenium (with Python ❤️ ) in a step-by-step tutorial.

Selenium refers to a number of different open-source projects used for browser automation. It supports bindings for all major programming languages, including our favorite language: Python.

Lean how to scrape the web with Selenium and Python with this step by step tutorial. We will use Selenium to automate Hacker News login. Kevin Sahin Updated: 11 February, 2021 8 min read. The free-to-use software Selenium is a framework for automated software testing for web applications. While it was originally developed to test websites and web apps, the Selenium WebDriver with Python can also be used to scrape websites.

The Selenium API uses the WebDriver protocol to control a web browser, like Chrome, Firefox or Safari. The browser can run either localy or remotely.

At the beginning of the project (almost 20 years ago!) it was mostly used for cross-browser, end-to-end testing (acceptance tests).

Now it is still used for testing, but it is also used as a general browser automation platform. And of course, it us used for web scraping!

Selenium is useful when you have to perform an action on a website such as:

Clicking on buttons
Filling forms
Scrolling
Taking a screenshot

It is also useful for executing Javascript code. Let's say that you want to scrape a Single Page Application. Plus you haven't found an easy way to directly call the underlying APIs. In this case, Selenium might be what you need.

Installation

We will use Chrome in our example, so make sure you have it installed on your local machine:

selenium package

To install the Selenium package, as always, I recommend that you create a virtual environment (for example using virtualenv) and then:

Quickstart

Once you have downloaded both Chrome and Chromedriver and installed the Selenium package, you should be ready to start the browser:

This will launch Chrome in headfull mode (like regular Chrome, which is controlled by your Python code).You should see a message stating that the browser is controlled by automated software.

To run Chrome in headless mode (without any graphical user interface), you can run it on a server. See the following example:

Using Selenium Python

The driver.page_source will return the full page HTML code.

Here are two other interesting WebDriver properties:

driver.title gets the page's title
driver.current_url gets the current URL (this can be useful when there are redirections on the website and you need the final URL)

Locating Elements

Locating data on a website is one of the main use cases for Selenium, either for a test suite (making sure that a specific element is present/absent on the page) or to extract data and save it for further analysis (web scraping).

There are many methods available in the Selenium API to select elements on the page. You can use:

Tag name
Class name
IDs
XPath
CSS selectors

We recently published an article explaining XPath. Don't hesitate to take a look if you aren't familiar with XPath.

As usual, the easiest way to locate an element is to open your Chrome dev tools and inspect the element that you need.A cool shortcut for this is to highlight the element you want with your mouse and then press Ctrl + Shift + C or on macOS Cmd + Shift + C instead of having to right click + inspect each time:

find_element

There are many ways to locate an element in selenium.Let's say that we want to locate the h1 tag in this HTML:

All these methods also have find_elements (note the plural) to return a list of elements.

For example, to get all anchors on a page, use the following:

Youtube download in computer. Some elements aren't easily accessible with an ID or a simple class, and that's when you need an XPath expression. You also might have multiple elements with the same class (the ID is supposed to be unique).

XPath is my favorite way of locating elements on a web page. It's a powerful way to extract any element on a page, based on it's absolute position on the DOM, or relative to another element.

WebElement

A WebElement is a Selenium object representing an HTML element.

There are many actions that you can perform on those HTML elements, here are the most useful:

Accessing the text of the element with the property element.text
Clicking on the element with element.click()
Accessing an attribute with element.get_attribute('class')
Sending text to an input with: element.send_keys('mypassword')

There are some other interesting methods like is_displayed(). This returns True if an element is visible to the user.

It can be interesting to avoid honeypots (like filling hidden inputs).

Honeypots are mechanisms used by website owners to detect bots. For example, if an HTML input has the attribute type=hidden like this:

This input value is supposed to be blank. If a bot is visiting a page and fills all of the inputs on a form with random value, it will also fill the hidden input. A legitimate user would never fill the hidden input value, because it is not rendered by the browser.

That's a classic honeypot.

Full example

Here is a full example using Selenium API methods we just covered.

We are going to log into Hacker News:

In our example, authenticating to Hacker News is not really useful on its own. However, you could imagine creating a bot to automatically post a link to your latest blog post.

In order to authenticate we need to:

Go to the login page using driver.get()
Select the username input using driver.find_element_by_* and then element.send_keys() to send text to the input
Follow the same process with the password input
Click on the login button using element.click()

Should be easy right? Let's see the code:

Easy, right? Now there is one important thing that is missing here. How do we know if we are logged in?

We could try a couple of things:

Check for an error message (like “Wrong password”)
Check for one element on the page that is only displayed once logged in.

So, we're going to check for the logout button. The logout button has the ID “logout” (easy)!

We can't just check if the element is None because all of the find_element_by_* raise an exception if the element is not found in the DOM.So we have to use a try/except block and catch the NoSuchElementException exception:

Taking a screenshot

We could easily take a screenshot using:

Note that a lot of things can go wrong when you take a screenshot with Selenium. First, you have to make sure that the window size is set correctly.Then, you need to make sure that every asynchronous HTTP call made by the frontend Javascript code has finished, and that the page is fully rendered.

In our Hacker News case it's simple and we don't have to worry about these issues.

Waiting for an element to be present

Dealing with a website that uses lots of Javascript to render its content can be tricky. These days, more and more sites are using frameworks like Angular, React and Vue.js for their front-end. These front-end frameworks are complicated to deal with because they fire a lot of AJAX calls.

If we had to worry about an asynchronous HTTP call (or many) to an API, there are two ways to solve this:

Use a time.sleep(ARBITRARY_TIME) before taking the screenshot.
Use a WebDriverWait object.

If you use a time.sleep() you will probably use an arbitrary value. The problem is, you're either waiting for too long or not enough.Also the website can load slowly on your local wifi internet connection, but will be 10 times faster on your cloud server.With the WebDriverWait method you will wait the exact amount of time necessary for your element/data to be loaded.

This will wait five seconds for an element located by the ID “mySuperId” to be loaded.There are many other interesting expected conditions like:

element_to_be_clickable
text_to_be_present_in_element
element_to_be_clickable

You can find more information about this in the Selenium documentation

Python 3 Web Scraping

Executing Javascript

Sometimes, you may need to execute some Javascript on the page. For example, let's say you want to take a screenshot of some information, but you first need to scroll a bit to see it.You can easily do this with Selenium:

Conclusion

I hope you enjoyed this blog post! You should now have a good understanding of how the Selenium API works in Python. If you want to know more about how to scrape the web with Python don't hesitate to take a look at our general Python web scraping guide.

Selenium is often necessary to extract data from websites using lots of Javascript. The problem is that running lots of Selenium/Headless Chrome instances at scale is hard. This is one of the things we solve with ScrapingBee, our web scraping API

Selenium is also an excellent tool to automate almost anything on the web.

If you perform repetitive tasks like filling forms or checking information behind a login form where the website doesn't have an API, it's maybe* a good idea to automate it with Selenium,just don't forget this xkcd:

This article was also featured at Towards Data Science here.

Motivation

I’m currently looking for a new data science role, but have found it frustrating that there are so many different websites, which list different jobs and at different times. It was becoming laborious to continually check each website to see what new roles had been posted.

But then I remembered; I’m a data scientist. There must be an easier way to automate this process. So I decided to create a pipeline, which involved the following steps, and to automate part of the process using Python:

1. Extract all new job postings at a regular interval

I decided to write some Python code to web-scrape jobs from the websites I was checking the most. This is what I’ll outline in this post.

2. Check the new job postings against my skills and interests

I could have tried to automate this, but I didn’t want to risk disregarding a job that may be of interest because of the criteria that I put into the code. I decided to manually review the postings that my web-scraping returned.

3. Spend one session a week applying for the new jobs that made the cut.

I have heard about people automating this stage. However, I believe chances will be better if you take the time and effort to make tailored applications, and somewhat disagree with this approach on principle.

NOTE: all code discussed in this post is available here.I’ve also shared my favourite resources for learning everything mentioned in this post at the bottom.

I decided to use BeautifulSoup and, if needed, Selenium so my import statements were the following:

One of the main sites I was checking for data science jobs was Indeed.co.uk.

(1) Extracting the initial HTML data

I was pleased to see that they had a standardised format for URL, which would make the web scraping easier. The URL was ‘indeed.co.uk/jobs?’ followed by ‘q=job title’ & ‘l=location’ — as below:

This made tailoring the job title and location pretty easy. I decided to create a function that would take in job title and location as arguments, so that I anybody could tailor the search:

Using the urlencode function from urllib enabled me to slot the arguments in to create the full url. I included ‘fromage=list’ and ‘sort=date’ within the URL, so that only the most recent jobs were displayed.

Then, with the handy help of BeautifulSoup, I could extract the HTML and parse it appropriately.

Finally, I wanted to find the the appropriate <div> that contained all of the job listings. I found this by opening the URL (indeed.co.uk/jobs?q=data+scientist&l=london) and using the ‘Inspect’ element. Using this, I could see that <td id=“resultsCol”> contained all of the job listings, so I used soup.find(id=“resultsCol”) to select all of these jobs.

(2) Extracting job details

Now that I had the ‘soup’ of HTML containing all the job listings, the next step was to extract the information I wanted, which were: Free cd burning software for mac.

The job titles
The companies
The link to the full job profile
The date is was listed

For each of these, I again used Inspect to identify the appropriate section, and used the .find() function to identify them, as follows:

(3) Iterating over each job listing

Using ‘Inspect’ I saw that each job card was contained within a div with the class ‘jobsearch-SerpJobCard’, so I used BeautifulSoup’s .find_all function as follows:

Then, for each card I wanted to extract the 4 pieces of key information listed above and save them in a list.

I wanted to make my function generalisable, so that people could choose which characteristics they wanted to search for (out of job titles, companies, link and date listed), so I created a list ‘desired_characs’ to specify this.

Web Scraping Using Selenium And Beautiful Soup Python Pdf

For each characteristic, I looped over and added them to a list as follows:

Finally, I brought all of these into a jobs_list, which could then be exported to the chosen format — an Excel, DataFrame or otherwise:

Using ‘cols’ enabled me to specify the titles for each key for the jobs_list dictionary, based on the characteristics that were extracted.

‘extracted_info’ was a list of lists; each list containing, for example, all the job titles or all the companies.

Using these data structures made compiling the final jobs_list dictionary much easier.

(4) Storing and saving jobs

I converted the ‘jobs_list’ dictionary into a DataFrame and then exported it to the filename and filetype that the user selects with the following function:

(5) Integrating into a single function call

Finally, I wanted users to be do all of the above with a single function call. I did so as follows:

Satisfying, this produced a final result which could be called pretty easily, as follows:

The next step was to generalise my script to also take in job listings from other websites. Another site I’ve been searching a lot was CWjobs. However, adding this proved more of a challenge.

Python Web Scraping Beautifulsoup

When I inspected the URL, I noticed that there wasn’t a consistent pattern with the keyword arguments.

Therefore, I decided I would use a Selenium Webdriver to interact with the website — to enter the job title and location specified, and to retrieve the search results.

(1) Downloading and initiating the driver

I use Google Chrome, so I downloaded the appropriate web driver from here and added it to my working directory. I then created a function to initiate the driver as follows:

(if using an alternative browser, you would need to download the relevant driver, and ensure it has the name and location as specified above)

(2) Using the driver to extract the job information HTML ‘soup’

(3) Extracting the job information

The next step was the same as Steps 2 and 3 from the Indeed job scraping above, only tailored to CWjobs’ DOM structure, so I won’t go over it again here.

Once I finished the code, it was pretty cool seeing my web browser being controlled, without me touching the mouse or keyboard:

I was pretty happy with my functioning code, and decided to stop there (I did actually have to apply for the jobs after all).

A quick search on GitHub revealed that people had already made similar job scrapers for LinkedIn Jobs and a few other platforms.

I uploaded the code to a GitHub repository with a READme incase anybody else wanted to scrape jobs using this code.

Then I sat back and felt pretty happy with myself. Now, I just run the scripts once a week, and then pick out the jobs that I want to apply for.

If I’m honest, the time and effort gains from using the script rather than doing things manually are fairly marginal. However, I had a lot of fun writing the code and got a bit more practice with BeautifulSoup and Selenium.

As a potential next step, I may set this up on Amazon Lambda, so that it does the new job search automatically once a week without me needing to. So there’s a possible future post coming on this if I do end up doing it.

I hope this was helpful/interesting!

https://realpython.com/beautiful-soup-web-scraper-python/ — a great overview / refresher on web scraping with Beautiful Soup

https://towardsdatascience.com/looking-for-a-house-build-a-web-scraper-to-help-you-5ab25badc83e — another interesting use case for a web scraper: to identify potential houses to buy

https://towardsdatascience.com/controlling-the-web-with-python-6fceb22c5f08 — a great introduction of Selenium

https://www.youtube.com/watch?v=--vqRAkcWoM — a great demo of selenium in use

https://github.com/kirkhunter/linkedin-jobs-scraper — a LinkedIn scraper, which demonstrates both selenium and beautiful soup