How to Parse Email Data with Python

This post will examine how to parse email data with Python.

You will go step by step through the program emaildb.py.

This program will read through mbox-short.txt file, count the occurrences of each email, and put that data in a database.

In the first line of code you will import a library you need to talk to the database.

The second line of code establishes a connection to the database file emaildb.sqlite.

The third line of code creates a cursor object that allows you to make commands to the database.

email db

Next, you will call the execute method on the cursor. This program will create a new table called “Counts” every time it runs. So, first it will drop the table if it exists, and then it will create a new table.

drop table

Notice how you are using SQL commands in Python to talk to the database.

The next several lines of code are similar to what you learned in the Python dictionary post. Here, you are parsing through the file and pulling out the email address.parse data

 

The next line of code is different. This line uses a technique called parameter substitution.

parameter sub

In this line, you are selecting a row from the table that match the email. The question mark after email serves as a placeholder for a value that will be substituted in. You are substituting in the current value of the email variable.

Next, you will use a “try” statement. You want to use “try” because your program will blow up if no rows were found that matched the email from the previous line. If the email was found, then you advance the count of that email.

In the “except” statement, you are saving the program from blowing up from the email not being found in the table. Rather, you insert this new email into the table, and start the count at 1.

Finally, the conn.commit() statement is very important, because it writes your new changes to the database.

commit

The next thing you will do is run select statement to list the top 10 emails in descending order.

top 10 emails

Before closing out the program, you will loop through the rows and print the columns for each row. It is a good idea to convert your column fields to str type before printing.

print entries

Make an SQL Database

In this post you will make an SQL database.

If you have not done so already, you need to install an SQL database tool. See picture below.

SQLite

SQLite Browser reads and writes a special type of file that is optimized for rapid retrieval, and storage of data.

Run SQLite Browser and create a new database. Once complete, your screen should display the following:

create database

Now, go ahead and create a table. Use the following command on the right side of the picture shown below (assume case sensitivity). This specifies a new table called “Users”, with two columns, and up to 128 length variable characters. This code also forms a contract with the table. A contract with what is allowed, and not allowed. Put this command in the execute SQL window.

create table

It is up to you regarding the terms of the contract, but you need to plan it ahead of time.

Next, add data (“records”) in your table as shown below.

add records

You can even view the SQL Log to view the code every time you add a new record.

Now, you can learn for SQL commands to insert, delete, and retrieve data.

 

You put the SQL commands in the Execute SQL window.

At any time you can hit the “Browse Data” button to see all your records.

Following is how you delete a row.

delete row

 

This will delete all rows that have the matching email address.

The update statement allows you to find certain records and change their values.

sql update

Finally, you can do a retrieval with the statement shown below.

retrieving records

The “SELECT *” command translate to “Select All”.

You can also sort your records.

sorting records

Below is a summary of all the SQL commands you have learned in this post.

SQL summary

This covers the basics of SQL databases. The real power will come later when you learn to exploit connections between tables.

Database Application SQLite and Python

This post will introduce the database application SQLite in relation to Python.

For large project, there are two main roles. One of those roles in the database administrator, who often consults with the developer (the other role). These are rather specialized jobs.

two roles

The administrator talks directly to the database, while the developer talks to indirectly, by way of the application. The following picture illustrates this split between roles, in large projects.

large project structure

However, for smaller projects, one person can handle both roles.

By handling both these roles you will:

  1. Use the database application SQLite to create tables.
  2. Write Python programs to retrieve, clean, and put cleaned data in the tables.
  3. Write another program to pull the cleaned data out and output a nice file.

For now, you will focus on the first step. You will learn how to create a database model, or contract.

database model

There are several common database systems. Oracle dominated this market, mainly because they were the first to embrace the relational model concept. However, there are good alternatives.

common database systems

 

In the context of Python, you will use SQLite. As it turns out, SQLite is quite popular.

SQLite is Popular

SQLite is fast, and is good for smaller amounts of data. Most importantly, the database application SQLite is embedded in Python.

 

Using Databases with Python

This post will discuss using databases with Python.

It might seem strange to discuss databases in the context of learning Python. However, if you are analyzing data, pulling data from over the network, then it makes sense to store that data in a database. You can then set up a process of pulling data from the database as you need it. Overall, it can speed up your workflow.

Before continuing, you need to download DB Browser for SQLite. http://sqlitebrowser.org/

DB Browser SQLite

Relational databases comprise a whole sub-field of computer science. They are relevant because they allow you to pull out an entry from huge amounts of data in a split-second, whereas it would take you much longer if you had to read through the data.

relational databases

You can look no further than Oracle to understand the relevance of relational databases. The majority of their revenue comes from their database products.

The underlying foundation of databases is rooted in mathematics. This is present in some of the terminology that some people use to describe databases.

database terminology

The idea behind databases is that you model data at a connection point.

database model

However, programmers tend to think of it in terms of rows and columns.

Typically, when you make a table, the first row becomes metadata for the table. You often use the first row to title what each column is for. Therefore, you can refer to this first row as the schema for the table. It sets the rules for each column with regard to what goes there, for example, a string, an integer, etc.

In the early 1960s, the database pioneers figured out ways to quickly retrieve data from random access memory, without having to go through the data sequentially. However, databases were very complex. As a result, a new layer of architecture evolved called the database application. This allows you, the programmer, to talk with the database, by way of the database application. At this point, an industry standard was desired for the language for the API between a database and its appliction. The name of the language the industry agreed on was SQL (Structured Query Language).

metadata

SQL is a great language, but it depends on the data being clean (structured). The nice thing about Python, is it can deal really well with unstructured data. So together, Python + SQL, you have a powerful combination.

SQL

Inheritance – Extending a Class

This post will introduce the concept of inheritance.

Inheritance is a very important concept for object orientation. It is the idea that you can write a class, and borrow stuff from another class. It can also be referred to as extending a class.

Inheritance

Look at the code below where you have the “PartyAnimal” class, and a new class “FootballFan”. “FootballFan” extends “PartyAnimal”. Thus, “FootballFan” has the member variable “x”, “name”, and “points”. Everything from “PartyAnimal” gets pulled into the new class, plus another method is added.

football fan class

You can see that “FootballFan” is a class which extends “PartyAnimal”. It has all the capabilities of “PartyAnimal” and more.