This post will give a BeautifulSoup example to demonstrate its usefulness as a Python scraper.
A problem you will encounter with HTML is that while the code might be technically correct, it could be edited in a very ugly fashion.
Even if you understand HTML, it can be hard to read if the code is ugly.
For example, there could be uneven indentations, inconsistent line spacing, or a host of other bad elements. A BeautifulSoup example will show how it can easily be used as a Python HTML parser.
After you download BeautifulSoup, place the BeautifulSoup.py file in the same file as your Python programs. You can download it here →
The demonstrations in this post will show you how to use a BeautifulSoup example with Python 2, rather than Python 3. The concepts are very similar for both versions of Python, but installation is a bit different.
BeautifulSoup Example for Retrieving Web Pages
Thanks to BeautifulSoup, it is very easy to retrieve web pages, and print all the “href” attributes of the anchor tags. These are essentially the links that go to other web pages. The whole program to do this is shown in the picture below.
The second line in the code pictured above is crucial because it imports all routines in the BeautifulSoup.py file.
The variable “html” (which could be could anything, but calling it html makes sense) is used to return a string consisting of the entire HTML page.
The variable “soup” becomes an object of parsed HTML data. You can then ask to retrieve certain things from this variable.
How to Print All Anchor Tags in an HTML Document
An anchor tag in HTML looks like <a> </a>, so by passing ‘a’ into the soup object you will get the web address of the actual page that the anchor tag links to.
This BeautifulSoup example its power as a Python scraper, using the “urllib” and “BeautifulSoup” libraries to parse HTML.