An Introduction on Python Objects

This post will introduce a discussion about Python objects.

As the complexity of your programs increase, it’s a good idea to gain an understanding about object-oriented programming. This post will not examine a new skill, but rather introduce terminology that will need to know.

Python Objects

As your programs get more complex, you will need more complex data structures. Consider this example pictured below, where you construct a list, and inside this list is a dictionary. The dictionary of movies comprises the list.

Construct List

Coming up with shapes of data is part of solving programming problems. You can see in the example above, how it has been decided that each dictionary, in the list of movies, will be shaped a certain way. If each dictionary has the same shape, then you can write code that takes advantage of the consistency in shape.

Shape

As you can see in the above program, you will loop through the keys that you expect to be there.

In summary, the idea is to find ways to make data structures with consistency.

How to Talk to an Application Program Interface (“API”)

This post will focus on how to talk to an application program interface.

As you talk to APIs or web services, you have to understand how they think. You will need to read the set of rules for the API. The rules will tell you how to interface with the application.

API

There are a couple of choices for web service technologies. SOAP is considered difficult to work with. It is much easier to work with REST.

SOAP

A nice API to learn is the Google Maps Geocoding. It is always a good idea to read the API documentation.

Run the program geojson.py, and enter “Ann Arbor, MI” for the user input. The program will return the following JSON object:

JSON
This JSON data results from entering the URL for “Ann Arbor, MI” in the Google Geocoding API.

The nice thing about a REST based service is you can take the URL and paste it in a browser. You derive how to put the URL together from the API documentation.

The URL will retrieve JSON that gives you lots of data about the location. You can parse it with the JSON library in Python.

The following picture shows the entire geojson.py program:

longer code

Running the program prompts the user for a location. This example showed the JSON results data for entering “Ann Arbor, MI”.

Notice the program imports the “urllib” library, which gives you power to retrieve data on the internet. The “json” library gives you power to parse data that comes back.

The “serviceurl” is the one you get from reading the API documentation, but Python is able to encode it automatically. Look at the line which calls the method “urllib.urlencode”. This line of code is what encodes the URL.

The use of a “try” and “except” checks if the data is bad. If the data is bad then the loop breaks and the user is prompted to enter in a new location.

The line “print json.dumps(js, indent=4)” will dump the JSON object into a string and print it out nicely with indentation.

The lines of code for “lat” and “lng” are a bit tricky. It parses through dictionaries inside dictionaries from the JSON object.

The data this API provides can be very valuable, so do not assume the API is always free.

 

SOA Web Services Approach

SOA web services means an approach to develop a web application that makes use of services. These services are often RESTful web APIs that return data back to the client. The service could also validate the client, or provide some type of analysis. SOA stands for service-oriented architecture.

A site that can book an airline ticket may also be able to book a hotel room. However, the credit card information is not being shared among different companies. Rather, the site is assembled with different meta-applications that provide various services.

These meta-applications act as services for the large application. The SOA web services approach is when you write an application that makes use of these services.

SOA Web Services
You must know the Application Program Interface (API) to use a web service.

An API exposes functionality to the outside world through a set of industry accepted standards and protocols.

multiple systems

 

JSON Serialization Format for Pyhon

This post will examine the JSON serialization format (“JavaScript Object Notation”).

XML is good at representing things that may have elements nested within elements, like documents.

JSON is not so great at representing documents, but it is very good at representing many other types of data.

JSON

JSON is a cleaned up version of the constant syntax of JavaScript. In Python, the constant syntax for a Python list looks like this:

my_list = [‘item1’, ‘item2’, ‘item3’]

 

JavaScript uses arrays, instead of lists, but these are just different means to the same end. Also JavaScript has objects, but Python has dictionaries. Because JSON is a cleaned up version of JavaScript, it actually looks very similar to Python. Thus, if you already know Python, it should be very natural to look at JSON.

JSON was defined by Douglas Crockford. Once he published it, people quickly started using it. JSON is now an entire industry within itself. Its pure organic growth is a testament to its usefulness.

JSON has two basic structures. They are an array and an object. It’s best advantage is that in Python you tend to make lists and dictionaries. JSON is a great way to represent those.

Look at  the picture of some JSON below. It may seem familiar to you.

JSON Code

  • The data represents an object inside the triple quote syntax (which technically makes it a string).
  • After the first curly bracket, you have key / value pair followed by a comma.
  • The first key / value pair is “name” : “Chuck”.
  • In the second key / value pair, the value is a whole other object.
  • The key is “phone”, and its value is another object with two key / value pairs.

If you look at the whole outer thing, there are three keys: “name”, “phone”, and “email”.

This is the basic information about how you structure data, but the main thing you need to think about is how to de-serialize the data.

Like many other thinks, JSON is built-in to Python. This is why you start your code with:

import json

 

The next step is to de-serialize from string to internal Python data structure.

info = json.loads(data)

 

The method “loads” is saying load from string, and data is the string that you are passing in as the parameter.

The really nice part is that “info” is returned as an actual Python dictionary. You pull information out of this dictionary the same as you would any other native Python dictionary. Thus, running this code will result in the following:

Run Code

JSON Representation of an Array

JSON Data

The array “input” starts with square brackets. This is the same as a list in Python. In this case, “input” is an array of two objects. The objects are inside curly brackets, and separated by a comma.

Examine the following declaration:

info = json.loads(input)

As you could maybe guessed, this will return a native Python list. As with any list, you can use a “for” loop to iterate through the list items.

Running this program should result in what you would expect.

Program Output

How to Parse XML with Python

This post will focus on how to parse XML with Python.

Fortunately, XML is built-in to Python. So, this makes parsing XML fairly straight forward.

Open the file xml1.py. In this program, the XML data presents itself as a string. Note that the syntax for the string are triple quotes. Single quotes are used, because double quotes are part of XML. The new lines are part of the string.

xml1.py

At the beginning of your code, you should put the following import statement to pull in the XML parsing mechanism.

import xml.etree.ElementTree as ET

Below the data string, you see a line of code as follows:

tree = ET.fromstring(data)

The method “fromstring”, in the Element Tree library, passes in the data, and makes it an object. The object is given the name tree. Now, you can look at the underlying data inside the object.

Below is a screenshot that shows the result if you run this program.

result

Next, look at the xml2.py program. This code will parse out the list of users.

Wk5e_Parse_XML_3

In this program, the input gets converted to an object called stuff. A list is then created for each user in users. Notice a path is specified to find all the users. Next, The length of the list is printed, which tells you the number of users.

After you print the number of users, you can loop through your list of users and print the data you want.

How XML Schema Validates XML

XML Schema is a way to describe what is valid or not valid XML.

XML Schema

XML Schema is used for validation between applications. For example suppose communication between an airline company and a hotel company suddenly breaks. The XML schema is used to check on which side the mistake was made.

Picture below is a sample document and schema contract. You can see the tags between the two match up. However, if the document had a different tag name, than as specified in the contract, it would not get validated.

XML Valdation
If XML tags in the document agree with the schema, then the XML will get validated.

In essence, a schema formalizes the relationship between applications. There are many types of XML schema languages, but XSD from W3C tends to be the most common.

Look at the picture below for an example of XSD constraints. Constraints serve to lock-in the contract between applications.

XSD Contstraints

You should also be familiar with the various XSD data types.

XSD Data Types

You need to understand the date/time format, so that you will know how to sort it.

Date Format
It is best practice to not change the date format.

It is best to stick with this format when working with dates and time inside a computer.

Use eXtensible Markup Language – XML

This post will examine when to use eXtensible Markup Language (XML).

XML stands for eXtensible Markup Language. Most programmers would probably prefer JSON, which is the other common wire formatting language, but XML does have advantages in certain circumstances.

XML is good for representing documents. For example, the new format of Microsoft Word and PowerPoint ends in “x”, which stands for XML.

XML
XML stands for eXtensible Markup Language.

XML is a textual representation of a tree structure with nodes. There are both simple and complex elements. Complex elements have tags within tags. Look at the picture below for an example.

XML Elements
This picture represents the difference between simple and complex elements.

Further, look at another picture for an illustration of more XML basics.

XML Basics
This picture color codes the basics of XML.

Indentation is used just for readability. In other words, white space is generally discarded.

In XML, unlike HTML, you make up the tag and attribute names to be useful in what you are describing.

XML Terminology

Indentation is often used to capture the nesting of elements.

For example:

  • In the picture below, the <a> tag has two child tags <b>, and <c>.
  • These tags are one level down from the root <a> tag. 
  • You could say <a> is the parent of <b> and <c>.
  • Also, <c> is the parent of <d> and <e>.
  • Text nodes and attribute nodes are considered children of the node itself.

XML as a tree

As a Python programmer, you could write code that traverses down tags, and pulls out information.

Web Services for Data on the Web

This post will discuss common web services.

Rather than retrieve and parse HTML documents, web services are URLs designed specifically to hand you data back for your application.

Web Services

XML and JSON are the two commonly used web services to format language going back and forth across the internet.

The problem is finding a way to send data that different programming languages can agree on. A Python dictionary, for example, is internally different from a Java hashmap, even though these data structures serve the same purpose. A “wire protocol” is how you send data structures in Python, that Java can agree on.

Wire Protocol
You send data across the net using a wire protocol.

The need for this wire protocol spawned two new terms.

Serialize is the act of taking an internal data structure, and creating a wire format.

De-Serialize is the act of taking the wire format and creating an internal data structure in a different language.

The wire protocol allows us to create sets of applications that work in different languages. Below is an example of the XML wire format.

XML Wire Format
This is an example of the XML wire format.

The next picture below is an example of the JSON wire format.

JSON Wire Format
This is an example of the JSON wire format.

XML and JSON are the two most common wire formats used for applications to exchange data.

BeautifulSoup Example as a Python Scraper

This post will give a BeautifulSoup example to demonstrate its usefulness as a Python scraper.

A problem you will encounter with HTML is that while the code might be technically correct, it could be edited in a very ugly fashion.

Even if you understand HTML, it can be hard to read if the code is ugly.

For example, there could be uneven indentations, inconsistent line spacing, or a host of other bad elements. A BeautifulSoup example will show how it can easily be used as a Python HTML parser.

BeautifulSoup
Use BeautifulSoup as a Python scraper for HTML pages.

After you download BeautifulSoup, place the BeautifulSoup.py file in the same file as your Python programs. You can download it here

The demonstrations in this post will show you how to use a BeautifulSoup example with Python 2, rather than Python 3. The concepts are very similar for both versions of Python, but installation is a bit different.

BeautifulSoup Example for Retrieving Web Pages

Thanks to BeautifulSoup, it is very easy to retrieve web pages, and print all the “href” attributes of the anchor tags. These are essentially the links that go to other web pages. The whole program to do this is shown in the picture below.

BeautifulSoup example as a Python scraper
This program takes user input of an HTML page, and prints all the anchor tags from that page.

The second line in the code pictured above is crucial because it imports all routines in the BeautifulSoup.py file.

The variable “html” (which could be could anything, but calling it html makes sense) is used to return a string consisting of the entire HTML page.

The variable “soup” becomes an object of parsed HTML data. You can then ask to retrieve certain things from this variable.

How to Print All Anchor Tags in an HTML Document

An anchor tag in HTML looks like <a> </a>, so by passing ‘a’ into the soup object you will get the web address of the actual page that the anchor tag links to.

This BeautifulSoup example its power as a Python scraper, using the “urllib” and “BeautifulSoup” libraries to parse HTML.

 

Making Sense of HTML Documents – Using Python

This post will focus on making sense of HTML documents that you retrieve from a web server – using Python.

Look at the example pictured below. It displays useful code to retrieve a web page, and print out the content.

You can see HTML tags in the document. These tags are rendered on a web page to give it structure. Learning HTML is a whole other topic.

However, what you will focus on here is parsing through the content using Python, and looking for certain elements within the content.

Retrieve HTML
The purple text is an HTML link to another web page.

In the above picture, look at the string highlighted in purple. This string represents a link to another web page.

You can create a loop that parses out these types of string, puts it in a “fhand” variable, and opens the page. This type of loop could continue until it opens and prints all the content on the internet.

Realistically, your computer would get drained of its memory long before your loop completed parsing through all the web links on the internet, but this concept outlines the beginning of a web crawler. A web crawler employs what is referred to as web scraping.

Web Scraping

The Power of Web Scraping

Web Scraping gives you great power. You are literally able to make a copy of web, or part of it, given enough memory.

Some web servers employ shields, like a captcha for example, to ward off applications like Python from scraping their site. However, Python can usually outsmart these types of shields. On the other hand, some servers do not care if you scrape their pages.

Why Scrape HTML Documents

Why Web Scrape HTML Documents

You can see that there are many reasons why you may want to scrape the web. You could write Python code that checks for new apartments on Craigslist, for example. You could write Python code to pull social data.

Web scraping provides a way to pull data when there is no application program interface.

Some websites have rules regarding web scraping. Facebook, for example, does not allow it. Facebook does not display public data. You have to be logged in to see anything. So if you did try to scrape their site, your code would have to log you in first, and then Facebook could easily know it’s you scraping.

What next? See how this BeautifulSoup example makes it easy to scrape HTML.