BeautifulSoup Example as a Python Scraper

This post will give a BeautifulSoup example to demonstrate its usefulness as a Python scraper.

A problem you will encounter with HTML is that while the code might be technically correct, it could be edited in a very ugly fashion.

Even if you understand HTML, it can be hard to read if the code is ugly.

For example, there could be uneven indentations, inconsistent line spacing, or a host of other bad elements. A BeautifulSoup example will show how it can easily be used as a Python HTML parser.

BeautifulSoup
Use BeautifulSoup as a Python scraper for HTML pages.

After you download BeautifulSoup, place the BeautifulSoup.py file in the same file as your Python programs. You can download it here

The demonstrations in this post will show you how to use a BeautifulSoup example with Python 2, rather than Python 3. The concepts are very similar for both versions of Python, but installation is a bit different.

BeautifulSoup Example for Retrieving Web Pages

Thanks to BeautifulSoup, it is very easy to retrieve web pages, and print all the “href” attributes of the anchor tags. These are essentially the links that go to other web pages. The whole program to do this is shown in the picture below.

BeautifulSoup example as a Python scraper
This program takes user input of an HTML page, and prints all the anchor tags from that page.

The second line in the code pictured above is crucial because it imports all routines in the BeautifulSoup.py file.

The variable “html” (which could be could anything, but calling it html makes sense) is used to return a string consisting of the entire HTML page.

The variable “soup” becomes an object of parsed HTML data. You can then ask to retrieve certain things from this variable.

How to Print All Anchor Tags in an HTML Document

An anchor tag in HTML looks like <a> </a>, so by passing ‘a’ into the soup object you will get the web address of the actual page that the anchor tag links to.

This BeautifulSoup example its power as a Python scraper, using the “urllib” and “BeautifulSoup” libraries to parse HTML.

 

Leave a Reply

Your email address will not be published. Required fields are marked *