Make Your Python Socket Talk to the Internet

This post will show you how to make your Python socket talk to an application on another web server.

Once you establish a connection with your socket, you can use Python to browse web data. The most common protocol is HTTP (HyperText Transport Protocol). HTTP is a set of rules to allow browsers to retrieve web documents from servers over the internet.

python socket
Use this code to establish a socket in Python.

Examining the URL

Look at the URL in your location field or address bar of your web browser. It can be broken down into three parts.

For example, consider the URL http://dr-chuck.com/page1.htm.

  1. The first part is the “http”. This tells you what protocol is being used.
  2. The second part, “dr-chuck.com”, refers to the host you want to talk to.
  3. The last part, “page1.htm”, refers to the file you want to retrieve.

Every time you click on a link to get a new page on the internet, your browser initiates a request / response cycle to GET the new page. This, in a nutshell, is the act of surfing the web.

Web Surfing
The act of surfing…the web.

Use Python to Access Web Data

Wk3Network1

This post will discuss how to use Python to access web data.

  • Become familiar with the request and response cycle that your browser does to communicate with servers.
  • Become familiar with protocols that are happening when your browser is working to access data.
  • Know how to write Python programs that can access web data.

A Brief Discussion Regarding The Internet and Networking

The picture below describes the transport control protocol. It illustrates the basic method of how information goes back and forth from your computer and destination web servers.

TCP Protocol
The TCP layer of the network architecture serves to handle peer-to-peer connections between your computer and a web server.

Focus mainly on the transport layer of this architecture. This is the peer-to-peer connection between your computer and a web server. Think of it as a telephone call over the internet.

How The TCP Layer Relates to Python

When you talk to someone on your cell phone, you do not worry about how the connection is made. You simply become aware of the connection and start talking.

Use this cell phone cell phone analogy as a metaphor when making a socket inside your computer. A socket will allow Python to access web data.

Sockets

When you talk to other applications on the internet, you have to know the specific port number of the application you wish to access. TCP port numbers allow multiple applications to exist on the same server.

You can think of port numbers as extensions within a phone number. There is an IP address, and within that are numbers for various applications that may exist on the same server.

Below is a picture of common TCP port numbers. The one you will use mostly with Python is port 80.

TCP Port Numbers
The most common TCP Port Number you will use with Python is 80.

The Python Socket Library

Python has a socket library that already contains all the code you need to access web data.

There are three lines of code to use when making a socket. These three lines accomplish the following:

  1. Import the library
  2. Establish a socket.
  3. Define the end server.
Sockets in Python
Use these three lines of code when you need to make a socket.
For more information get the book Introduction to Networking. You can also take an Internet History course.

Practice Regular Expressions with Python Programs

A good way to practice regular expressions, is to take some of the Python programs you used before, and add Python regular expressions to give them sophistication.

Consider the example line below from the mbox-short.txt file.

From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008

 

Look at, and analyze the following Python regular expression, which will extract the email address.

re.findall(‘\S+@\S+’, x)

 

Assume ‘x’ has been assigned to your example line. This will match the ‘@’ character. Then it will push to the left, and to the right, until it encounters a space (‘\S+’). The ‘\S’ includes non-whitespace characters, and the ‘+’ includes those that occur one or more times.

You can practice regular expressions to fine tune it more. The following will only extract email addresses out of lines that start with ‘From ‘:

re.findall(‘^From (‘\S+@\S+’)’, x)

 

The (‘\S+@\S+’) is the only part that is returned in a list.

What if you only want to extract the domain from the example line?

Pictured below is the Python fundamental way of coding this program.

Extract Domain

You could also code this a fundamental way using a double split pattern.

Double Split Pattern

Coding this same program with a Python regular expression would result in the following:

re.findall(‘@([^ ]*)’, x)

 

Always refer to the Python regular expression guide for help with meaning of the special characters. If you do not want a special character to function with its special meaning, then prepend it with a backslash. For example ‘\$’ would be a real dollar sign, rather than match the end of a line.

Python Regular Expressions

 This post should serve as a basic guide for Python regular expressions.

It is recommended to learn Python basics before you learn Python regular expressions.

Regular expressions, in general, are a language unto themselves. They can be used in many different languages. They involve cryptic, yet very succinct ways of presenting solutions to programming problems. For some, regular expressions will come very natural, but others will prefer the step-by-step method of writing code. You do not have to know regular expressions, but if you do, you may find them quite fun to work with.

Wk2_RegExpDef

You might want to bookmark a character guide for Python regular expressions.

You can see from the guide that regular expressions are a language of characters. Certain characters have a special meaning, similar to how Python reserved words have special meaning. Shown below is a module you should follow if you want to make use of Python regular expressions in your program.

Wk2_RegExpMod

Consider these two example lines of text from the mbox-short.txt file.

X-DSPAM-Result:
X-Plane is behind schedule:

Now consider the following code:

import re
lines = open(‘mbox-short.txt’)
for line in lines:
++++line = line.rstrip()
++++if re.search(‘^X.*:’):

The “if” statement will catch lines that start with (‘^’) ‘X’, followed by any character (‘.’), zero or more times (‘*’), followed by a ‘:’. The ‘X’ and ‘:’ are not special characters, the other characters do have special meaning. This if statement should catch the two example lines of text written above.

Suppose you do not want to catch a line if it has blank spaces, or whitespace (the second example line). You would modify the regular expression as follows:

for line in lines:
++++line = line.rstrip()
++++if re.search(‘^X-\S+:’):

Now your if statement will only match lines that start with ‘X-‘, followed by non-whitespace (‘\S’), one or more times (‘+’), followed by a ‘:’.

Matching and Extracting Data with Python Regular Expressions

The method “re.search()” returns a True or False, depending if the regular expression finds a match.

Use “re.findall()” if you want matching strings to be extracted.

Consider these four lines of code below, in the Python interpreter.

Wk2_RegExpMatchExtract

The ‘[0-9]+’ represents any single digit that occurs one or more times. Therefore, the variable ‘y’ returns a Python list of matches from the parameter, which is ‘x’. So, “re.findall()” extracts matching data and returns a list.

It is important to know that matching regular expressions will return the largest possible string, by default. For example:

>>> import re
>>> x = ‘From: Using the : character’
>>> y = re.findall(‘^F.+:’, x)
>>> print y
[‘From: Using the :’]

Did you notice? The “re.findall()” did not stop at ‘From:’, because that is not the largest possible matching string. This concept is referred to as greedy matching, because it returns the largest possible match. If you wanted to stop at the first colon, then you would need to use non-greedy matching:

>>> y = re.findall(‘^F.+?:’, x)

This regular expression will return the shortest string match.