BeautifulSoup Example as a Python Scraper

This post will give a BeautifulSoup example to demonstrate its usefulness as a Python scraper.

A problem you will encounter with HTML is that while the code might be technically correct, it could be edited in a very ugly fashion.

Even if you understand HTML, it can be hard to read if the code is ugly.

For example, there could be uneven indentations, inconsistent line spacing, or a host of other bad elements. A BeautifulSoup example will show how it can easily be used as a Python HTML parser.

BeautifulSoup
Use BeautifulSoup as a Python scraper for HTML pages.

After you download BeautifulSoup, place the BeautifulSoup.py file in the same file as your Python programs. You can download it here

The demonstrations in this post will show you how to use a BeautifulSoup example with Python 2, rather than Python 3. The concepts are very similar for both versions of Python, but installation is a bit different.

BeautifulSoup Example for Retrieving Web Pages

Thanks to BeautifulSoup, it is very easy to retrieve web pages, and print all the “href” attributes of the anchor tags. These are essentially the links that go to other web pages. The whole program to do this is shown in the picture below.

BeautifulSoup example as a Python scraper
This program takes user input of an HTML page, and prints all the anchor tags from that page.

The second line in the code pictured above is crucial because it imports all routines in the BeautifulSoup.py file.

The variable “html” (which could be could anything, but calling it html makes sense) is used to return a string consisting of the entire HTML page.

The variable “soup” becomes an object of parsed HTML data. You can then ask to retrieve certain things from this variable.

How to Print All Anchor Tags in an HTML Document

An anchor tag in HTML looks like <a> </a>, so by passing ‘a’ into the soup object you will get the web address of the actual page that the anchor tag links to.

This BeautifulSoup example its power as a Python scraper, using the “urllib” and “BeautifulSoup” libraries to parse HTML.

 

Making Sense of HTML Documents – Using Python

This post will focus on making sense of HTML documents that you retrieve from a web server – using Python.

Look at the example pictured below. It displays useful code to retrieve a web page, and print out the content.

You can see HTML tags in the document. These tags are rendered on a web page to give it structure. Learning HTML is a whole other topic.

However, what you will focus on here is parsing through the content using Python, and looking for certain elements within the content.

Retrieve HTML
The purple text is an HTML link to another web page.

In the above picture, look at the string highlighted in purple. This string represents a link to another web page.

You can create a loop that parses out these types of string, puts it in a “fhand” variable, and opens the page. This type of loop could continue until it opens and prints all the content on the internet.

Realistically, your computer would get drained of its memory long before your loop completed parsing through all the web links on the internet, but this concept outlines the beginning of a web crawler. A web crawler employs what is referred to as web scraping.

Web Scraping

The Power of Web Scraping

Web Scraping gives you great power. You are literally able to make a copy of web, or part of it, given enough memory.

Some web servers employ shields, like a captcha for example, to ward off applications like Python from scraping their site. However, Python can usually outsmart these types of shields. On the other hand, some servers do not care if you scrape their pages.

Why Scrape HTML Documents

Why Web Scrape HTML Documents

You can see that there are many reasons why you may want to scrape the web. You could write Python code that checks for new apartments on Craigslist, for example. You could write Python code to pull social data.

Web scraping provides a way to pull data when there is no application program interface.

Some websites have rules regarding web scraping. Facebook, for example, does not allow it. Facebook does not display public data. You have to be logged in to see anything. So if you did try to scrape their site, your code would have to log you in first, and then Facebook could easily know it’s you scraping.

What next? See how this BeautifulSoup example makes it easy to scrape HTML.

Use Python for Web Scraping

This post will demonstrate how you write Python for web scraping.

Learning the HTTP application is fairly complex, but it is simple to apply in Python. The picture demonstrates how to make an HTTP request in Python.

HTTP Request

The line starting with “mysock.connect” is what pushes the socket out across the internet, and connects it to an endpoint.

It is crucial there is a server there to connect to, or else your code will crash right there at the third line. A crucial difference between connecting with a socket versus reading, is you can send and retrieve data with a socket.

Because you are using HTTP protocol, and you established the socket connection, then it is your responsibility to make the first communication.

The line starting with “mysock.send” makes first communication with a GET request. Once you make the GET request, you can scrape the data you want.

The while loop will receive data at 512 characters at a time. If the data is less than 512 characters, you will still receive it, unless it is less than one character.  Running this program should return the following data:

web scraping

Make the HTTP Request Easier

You might agree that the previous example showed you that it is fairly simple to make an HTTP request with Python. Well, there is a library called “urllib” that makes it even easier.

urllib in Python

The urllib library work like an extra application layer that makes a URL seem like it is just a file.

You can see that using urllib is similar to using a handle to open and read a file.

Wk3WriteBrowser5

Make Your Python Socket Talk to the Internet

This post will show you how to make your Python socket talk to an application on another web server.

Once you establish a connection with your socket, you can use Python to browse web data. The most common protocol is HTTP (HyperText Transport Protocol). HTTP is a set of rules to allow browsers to retrieve web documents from servers over the internet.

python socket
Use this code to establish a socket in Python.

Examining the URL

Look at the URL in your location field or address bar of your web browser. It can be broken down into three parts.

For example, consider the URL http://dr-chuck.com/page1.htm.

  1. The first part is the “http”. This tells you what protocol is being used.
  2. The second part, “dr-chuck.com”, refers to the host you want to talk to.
  3. The last part, “page1.htm”, refers to the file you want to retrieve.

Every time you click on a link to get a new page on the internet, your browser initiates a request / response cycle to GET the new page. This, in a nutshell, is the act of surfing the web.

Web Surfing
The act of surfing…the web.

Use Python to Access Web Data

Wk3Network1

This post will discuss how to use Python to access web data.

  • Become familiar with the request and response cycle that your browser does to communicate with servers.
  • Become familiar with protocols that are happening when your browser is working to access data.
  • Know how to write Python programs that can access web data.

A Brief Discussion Regarding The Internet and Networking

The picture below describes the transport control protocol. It illustrates the basic method of how information goes back and forth from your computer and destination web servers.

TCP Protocol
The TCP layer of the network architecture serves to handle peer-to-peer connections between your computer and a web server.

Focus mainly on the transport layer of this architecture. This is the peer-to-peer connection between your computer and a web server. Think of it as a telephone call over the internet.

How The TCP Layer Relates to Python

When you talk to someone on your cell phone, you do not worry about how the connection is made. You simply become aware of the connection and start talking.

Use this cell phone cell phone analogy as a metaphor when making a socket inside your computer. A socket will allow Python to access web data.

Sockets

When you talk to other applications on the internet, you have to know the specific port number of the application you wish to access. TCP port numbers allow multiple applications to exist on the same server.

You can think of port numbers as extensions within a phone number. There is an IP address, and within that are numbers for various applications that may exist on the same server.

Below is a picture of common TCP port numbers. The one you will use mostly with Python is port 80.

TCP Port Numbers
The most common TCP Port Number you will use with Python is 80.

The Python Socket Library

Python has a socket library that already contains all the code you need to access web data.

There are three lines of code to use when making a socket. These three lines accomplish the following:

  1. Import the library
  2. Establish a socket.
  3. Define the end server.
Sockets in Python
Use these three lines of code when you need to make a socket.
For more information get the book Introduction to Networking. You can also take an Internet History course.

Practice Regular Expressions with Python Programs

A good way to practice regular expressions, is to take some of the Python programs you used before, and add Python regular expressions to give them sophistication.

Consider the example line below from the mbox-short.txt file.

From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008

 

Look at, and analyze the following Python regular expression, which will extract the email address.

re.findall(‘\S+@\S+’, x)

 

Assume ‘x’ has been assigned to your example line. This will match the ‘@’ character. Then it will push to the left, and to the right, until it encounters a space (‘\S+’). The ‘\S’ includes non-whitespace characters, and the ‘+’ includes those that occur one or more times.

You can practice regular expressions to fine tune it more. The following will only extract email addresses out of lines that start with ‘From ‘:

re.findall(‘^From (‘\S+@\S+’)’, x)

 

The (‘\S+@\S+’) is the only part that is returned in a list.

What if you only want to extract the domain from the example line?

Pictured below is the Python fundamental way of coding this program.

Extract Domain

You could also code this a fundamental way using a double split pattern.

Double Split Pattern

Coding this same program with a Python regular expression would result in the following:

re.findall(‘@([^ ]*)’, x)

 

Always refer to the Python regular expression guide for help with meaning of the special characters. If you do not want a special character to function with its special meaning, then prepend it with a backslash. For example ‘\$’ would be a real dollar sign, rather than match the end of a line.

Python Regular Expressions

 This post should serve as a basic guide for Python regular expressions.

It is recommended to learn Python basics before you learn Python regular expressions.

Regular expressions, in general, are a language unto themselves. They can be used in many different languages. They involve cryptic, yet very succinct ways of presenting solutions to programming problems. For some, regular expressions will come very natural, but others will prefer the step-by-step method of writing code. You do not have to know regular expressions, but if you do, you may find them quite fun to work with.

Wk2_RegExpDef

You might want to bookmark a character guide for Python regular expressions.

You can see from the guide that regular expressions are a language of characters. Certain characters have a special meaning, similar to how Python reserved words have special meaning. Shown below is a module you should follow if you want to make use of Python regular expressions in your program.

Wk2_RegExpMod

Consider these two example lines of text from the mbox-short.txt file.

X-DSPAM-Result:
X-Plane is behind schedule:

Now consider the following code:

import re
lines = open(‘mbox-short.txt’)
for line in lines:
++++line = line.rstrip()
++++if re.search(‘^X.*:’):

The “if” statement will catch lines that start with (‘^’) ‘X’, followed by any character (‘.’), zero or more times (‘*’), followed by a ‘:’. The ‘X’ and ‘:’ are not special characters, the other characters do have special meaning. This if statement should catch the two example lines of text written above.

Suppose you do not want to catch a line if it has blank spaces, or whitespace (the second example line). You would modify the regular expression as follows:

for line in lines:
++++line = line.rstrip()
++++if re.search(‘^X-\S+:’):

Now your if statement will only match lines that start with ‘X-‘, followed by non-whitespace (‘\S’), one or more times (‘+’), followed by a ‘:’.

Matching and Extracting Data with Python Regular Expressions

The method “re.search()” returns a True or False, depending if the regular expression finds a match.

Use “re.findall()” if you want matching strings to be extracted.

Consider these four lines of code below, in the Python interpreter.

Wk2_RegExpMatchExtract

The ‘[0-9]+’ represents any single digit that occurs one or more times. Therefore, the variable ‘y’ returns a Python list of matches from the parameter, which is ‘x’. So, “re.findall()” extracts matching data and returns a list.

It is important to know that matching regular expressions will return the largest possible string, by default. For example:

>>> import re
>>> x = ‘From: Using the : character’
>>> y = re.findall(‘^F.+:’, x)
>>> print y
[‘From: Using the :’]

Did you notice? The “re.findall()” did not stop at ‘From:’, because that is not the largest possible matching string. This concept is referred to as greedy matching, because it returns the largest possible match. If you wanted to stop at the first colon, then you would need to use non-greedy matching:

>>> y = re.findall(‘^F.+?:’, x)

This regular expression will return the shortest string match.

Python Tuple – Another Python Collection

There are lists, dictionaries, and tuples. These are all Python collections.

A Python tuple is like a non-changeable list. Instead of the square brackets that are used for lists, for tuples you use the regular parenthesis.

You are not able to use many of the methods on a Python tuple, because they are immutable. For example, you can not use sort, reverse, or append.

You can use the dir function to check what you can do with tuples, compared to lists.

Python tuple versus list
Compare what you can do with a Python tuple versus a list.

Why would you use tuples if they are not as capable as lists? You would use a Python tuple because it is more efficient, and they require less memory. You should use Python tuples when creating a collection that is temporary.

Two Way TuplesA nice thing about Python is you can do two things in one by placing a tuple on both the left and right side of an assignment statement.

Two Way Tuples 2Please note, the left-hand side must contain variables. Also, you can omit parenthesis on the left-hand side.

If you remember, the ‘items’ method for a Python dictionary returns a (key, value) pair. This pair is a Python tuple, so you can use a tuple as an iteration variable to loop through a dictionary.

Another nice thing about tuples is they are comparable. Comparison operators work with tuples. The first element will be compared first. If they are equal, then Python will move to the next element. It stops when it finds elements that differ.

Comparable Tuples

Compare and Sort a Python Tuple

This ability to compare Python tuples is a nice feature, because things that can be compared can also be sorted. You can use the built-in sorted function to do this, like in the following example.

#Create a dictionary. A dictionary can not be sorted.
>>>d = {‘alpha’:5, ‘charlie’:3, ‘beta’:4}
#Assign to variable x a sorted list of tuples.
>>>x = sorted(d.items())
>>>x
[(‘alpha’, 5), (‘beta’, 4), (‘charlie’, 3)]

 

Notice it only sorts the key. You can loop through this to print in sorted key order.

>>>for k, v in sorted(d.items()):
… print k, v

alpha 5
beta 4
charlie 3

 

Do you remember finding the most common word program? What if you want to find the five most common words? Rather than sort through the key, you will want to sort the value in descending order.

#Create a dictionary. A dictionary can not be sorted.
>>>d = {‘alpha’:5, ‘charlie’:3, ‘beta’:4}
#Creat a temporary list.
>>>temp_list = list()
#Loop through the dictionary, but append the value first!
>>>for k, v in d.items():
… temp_list.append( (v,k) )

>>>print temp_list
[(3, ‘charlie), (4, ‘beta’), (5, ‘alpha’)]
#Reverse the sorted order of values.
>>>temp_list.sort(reverse=True)
>>>print temp_list
[(5, ‘alpha’), (4, ‘beta’), (3, ‘charlie’)]

 

The following is a program that will find the ten most common words in a text file.

10 most common words program
A Python program for finding the 10 most common words.

Once you become comfortable with this program, you can begin to understand ways to condense your code. The concept of list comprehension can make a dynamic list in one line.

# Start with the dictionary.
>>> d = {‘alpha’:5, ‘charlie’:3, ‘beta’:4}
# Use list comprehension to make a dynamic list.
>>> print sorted( [(v, k) for k, v in d.items()] )
[(3, ‘charlie), (4, ‘beta’), (5, ‘alpha’)]

 

The syntax inside the parenthesis serves as the list comprehension. It dynamically creates a list of the pair (v, k) as it itereates through the key, value pairs inside the dictionary. This syntax is rather dense, but you can use it as you become more comfortable programming in Python.

Python Dictionary – Powerful Data Collection

A collection, in Python, is like a piece of luggage that we can put things in. A variable is not a collection, because it stores only one value. Once a new value is assigned, the old value goes away. A Python dictionary, however, is considered a collection.

A Python dictionary allow us to store many things. The work like a variable that serves as an aggregate of many values.

The difference between a list and a dictionary is how the values are stored. A list is a linear collection, indexed by a value starting at zero. A Python dictionary is more like a bag of things. The things are not stored in any particular order, but each thing has its own label. We call the label a ‘key’, and the thing is its ‘value’.

A Python Dictionary are considered the most powerful data collection in Python. In other programming languages they are called different names like associative arrays, hash maps, or property bags.

You can create a Python dictionary as follows:

>>>suitcase = dict()
>>>suitcase[‘socks’] = 5
>>>suitcase[‘shirts’] = 3
>>>suitcase[‘pants’] = 2
>>>print suitcase
{‘socks’: 5, ‘shirts’: 3, ‘pants’: 2}

The socks, shirts, and pants are the ‘keys’ and the quantities are their ‘values’.

>>>suitcase[‘shirts’] = suitcase[‘shirts’] + 1
>>>print suitcase[‘shirts’]
4

That’s right! You just added to the value of shirts. However, unlike a Python list, there is no preserved order in a Python dictionary. Lists preserve order, dictionaries do not. Therefore, when you print the contents of a dictionary, do not expect it to come out in the same order you added the ‘key’: ‘value’ pairs.

You will get a traceback error if you reference a ‘key’ that is not in your dictionary. You check to see if the ‘key’ exists.

>>> print ‘underwear’ in suitcase
False

You can make an empty dictionary using curly brackets.

>>>empty_dic = {}

A common use for Python dictionaries is counting how often we see something.

counts = dict()
names = [‘bob’, ‘ted’, ‘bill’, ‘ted’, ‘bob’]
for name in names:
++++if name not in counts:
++++++++counts[name] = 1
++++else:
++++++++counts[name] = counts[name] + 1
++++print counts

The above Python script should print {‘bob’: 2, ‘ted’: 2, ‘bill’: 1}

This pattern is so common that Python has a built-in method called ‘get()’ that does it for us. For example, print counts.get(name, 0) will return the name and its value, but if the name does not exist then it starts the value at zero. It’s a very valuable method.

Using this ‘get() method, the above Python script can be condensed as follows:

counts = dict()
names = [‘bob’, ‘ted’, ‘bill’, ‘ted’, ‘bob’]
for name in names:
++++counts[name] = counts.get(name, 0) + 1
print counts

The following script will count the occurrence of each word in a line of text.

counts = dict()
print ‘Enter a line of text:’
line = raw_input(”)

words = line.split()
print words

for word in words:
++++counts[word] = counts.get(word, 0) + 1

print counts

Another common task is to use a definite loop on Python dictionaries.

for key in counts:
++++print key, counts[key]

The key is the actual word, and counts[key] is how many times the word was counted.

You can retrieve lists of keys and values with other built-in methods. For example, counts.keys() or counts.values(). There is counts.items(). This will return both keys and values. Each pair is referred to as a tuple. You can then loop through each key-value pair using two iteration variables.

for x, y in counts.items():
++++print x, y

Note, x is the ‘key’ and y is the ‘value’.

Now you should be able to fully understand the following script. It returns the most used word from a text file.

name = raw_input(‘Enter file:’)
handle = open(name, ‘r’)
text = handle.read()
words = text.split()
counts = dict()

for word in words:
++++counts[word] = counts.get(word, 0) + 1

bigcount = None
bigword = None
for word,count in counts.items():
++++if bigcount is None or count > bigcount:
++++++++bigword = word
++++++++bigcount = count

The Python List – Delve Into Data Science

Knowing how to manipulate a Python List is where you can really delve into data science. A Python list has square brackets. It is a collection wherein we assign multiple values to one variable. It is important to know how to find certain values within your lists.

Lists do not have to be of a single value type. However, converting a list to a numpy array will coerce the list to a single data type. A Python List should be converted to a numpy array if, for example, you want to make a scatter plot.

Lists can exist inside of a list.

You can look up values in a list, similar to how you lookup values in a string.

Remember the index operator from the lesson about Python strings?

>>> colors = [‘blue’, ‘green’, ‘red’]
>>> print colors[1]
green

 

Do not forget index values start at zero. That is why the above example returns ‘green’.

However, while strings are immutable, lists are mutable. This is a great feature of lists.

>>> lucky_numbers = [3, 21, 7, 68, 93]
>>> lucky_numbers[4] = 36
>>> print lucky_numbers
[3, 21, 7, 68, 36]

 

See! The fourth index value of the list was changed.

You can use ‘len’ to know the length of a list.

>>> print len(lucky_numbers)
5

 

You can use a range function.

>>> print range(len(lucky_numbers))
[0, 1, 2, 3, 4]

 

Now you know that lucky_numbers has a range of five values. You might want to loop through the list, while keeping track of the range value.

for i in range(len(lucky_numbers)):
++++number = lucky_numbers[i]
++++print number

 

You can concatenate lists with the ‘+’ operator.

You can slice a Python list. Again, this is similar to strings.

>>> print lucky_numbers[0:3]
[3, 21, 7]

 

There are many Python list methods that are built-in functions to do useful things to your list.

For example, you can append to a list.

>>> things = list()
>>> things.append(‘food’)
>>> things.append(5)
>>> print things
[‘food’, 5]

 

Find if something is, or is not in a list.

>>> ‘book’ in things
False
>>> 7 not in things
True

 

The ‘sort’ method will force a list to sort itself, like alphabetically for example.

There are lots great methods for lists of numbers, such as max, min, and sum.

You can loop through an input of numbers, and build a list.

A list building example.
How to loop through user input, and build a list with the input.

A very powerful method is ‘split’. This allows us to split a string into a list of words.

>>> lyric = ‘three little birds’
>>> words = lyric.split()
>>> print words
[‘three’, ‘little’, ‘birds’]

 

Split sees many spaces as just one space. So, if a line has lots of space at the end, then split will discard all that extra spice. This is very convenient.

Data could consist of no spaces, where every string is delimited by a comma, for example. You would pass the comma in as an argument to split.

>>> jibberish = ‘heh,reh,vtv’
>>> jiblist = jibberish.split(‘,’)
>>> print jiblist
[‘heh’, ‘reh’, ‘vtv’]