How to Webscrape with Requests, Selenium, and Beautifulsoup in Python

Photo by Chris Ried on Unsplash

This article is mainly for beginners at webscraping, and should help with thinking about how to scrape something specific off a website with the example below. The best way to learn methods on grabbing specific HTML tags is to find a website you frequently go to and try to automate something with the text you can grab from that site. Normal advice but; read the official docs when trying to pinpoint a HTML tag, then google carefully…

I typically just have test files for specific HTML tag targets that are hard to grab, and a main file with the rough draft of the entire scraper. Once the rough draft is done and the scrapes are outputting the correct format I want; start to abstract the code more. Once it works, just refactor it more if you have time.

sudo pip install requests Requests is fast, it is a single process of GET, or POST. Typically you want to set headers to not be flagged as a bot, otherwise you might get a bad response or even worse, a wall of captcha HTML. One thing to note is that requests cannot activate JavaScript functions on a website. So if the data that you’ve extracted looks different from the HTML inspection from your browser, try to see if you are activating any JavaScript when loading the site from your browser.

sudo pip install selenium Selenium is great for mimicking an actual browser, since it actually uses a browser. It is able to activate JavaScript on a GET request, and it is also very hard to detect as a bot as it seems more “natural” to bot detection. Especially if you add sleeps in-between actions. Typically Selenium is used for browser automation, as it’s very good at pinpointing HTML tags to click. However, it is also good at grabbing entire webpage HTMLs that require JavaScript activation. Headless mode, great for automating in the background. It does all of your code while not bringing up the browser at all. Good for speed if you are doing something repetitive. Be sure to check out Selenium’s webdriver page to understand more on how to use Selenium. Personally, I use Firefox mainly (geckodriver).

sudo pip install beautifulsoup4 Beautifulsoup is great at parsing HTML data, it’s python methods are very intuitive when navigating a HTML tree. Very good for pinpointing specific HTML tags if you want attributes. It also makes parent tags iterate-able, so you can think about them in loops more. A soup object pretty much changes each HTML tag into an object, so targeting a specific text or attribute becomes a lot easier.

Let’s create an example: We’ll use for an example practice. Let’s say we want to scrape all of non links in “Endpoints” in the “Quotes” section. I’ll be creating these steps as if it was live, since this is a new website and data target for me to scrape.

Step 1, Setup your workspace: You should have a browser on the side with inspect mode on. Your scraping code should be on one panel, and a place to run that code. I find this to be the most efficient way to testing a target quickly:

Step 2, Get the web page data: Try Requests, if I can’t extract a 200 response with the data I want, then I will go to Selenium. The main thing is getting that response data stored in a variable so that we can manipulate it to output the specific text that we want.

This is using selenium
This is using requests
This is running code using requests

Also make sure it is the actual HTML you want:

Actual HTML is bigger

You can also write it to a file in-case you don’t want to send multiple requests every time you test your output:

Step 3, What are we trying to output: We said earlier that we just want the list under “Endpoints” and it’s text only (non-links). So let’s inspect for a way to get that. We can see that there are two tables, but we only want one. So we can grab all the tables with alltables = soup.find_all("tables") . Next we can see the title of the table is pretty uniformed in the code. So next we can target that title with moving to the next next tag’s text by .next_element.next_element.text . We can just add this code into the loop so we check every table on the page. In this loop, we’ll make a conditional if the next next’s text is equal to "Endpoints" and print something if it finds it:

Lets run it:

Great, it found it. Now we can loop through each tag in this table to see how we can extract only the text. We can inspect in the browser and see that each tr of table has 2 td . So we can use the find_next() method to go to the 2nd td of each tr . Lets try that:

Perfect, we now have the text for the most part. Let’s just take care of the AttributeError quickly and store all the text into a list, and also remove the first item from the list as it is coming from the title of the table .

Now we have a list of the data we want. It’ll be easy to format this into anything you would want. For example, if you wanted to write these in between certain lines of a README generator, or if you wanted each item of this list to be inserted after every number of a string you have. This is generally how I would go about scraping data; you start off by trying to get the entire page first, and then slowly cutting away the stuff you don’t want.

I will discuss Selenium in a different article later, since it is a completely different topic. Requests and beautifulsoup4 is generally good enough for basic scraping.

Like my content? GithubTwitterMediumSupport Me

Software Engineer living in Tokyo | Linux | Cats | | |