Data Wrangling Cheat Sheet for Python

A data wrangling cheat sheet for python data-scientists starts with an initial exploratory analysis. This is the crucial first step in data cleaning. Just like an experienced chess player’s first moves can be scripted, a data-scientist might have several scripted steps to get familiar with the data.

Data Wrangling Cheat Sheet First Steps

Some good commands to start the data analysis are to import pandas, load the data into a data frame, and get an overview of the data. These commands are as follows:

# Import pandas
import pandas as pd

# Read the file into a DataFrame: df
df = pd.read_csv(‘file_name.csv’)

# Print the head of df
print(df.head())

# Print the tail of df
print(df.tail())

# Print the shape of df
print(df.shape)

# Print the columns of df
print(df.columns)

Note that .shape and .columns are not methods, but attributes. Thus, they don’t require the parenthesis.

Next, consider these steps:

# Print the info of df
print(df.info())

# Print the value counts for ‘String_Column’
print(df[‘String_Column’].value_counts(dropna=False))

# Print the description for ‘Numeric_Column’
print(df[‘Numeric_Column’].describe())

The .info() method will indicate if there is missing data. Moreover, it tells the number of rows, number of columns, and the data type for each column.

The .describe() method will calculate summary statistics on numeric data columns.

The .value_counts() method returns the frequency counts for each unique value in a column. This method also has an optional parameter called dropna which is True by default. Set dropna=False so it will provide the frequency counts of missing data as well.

Once a through exploration of the data is complete, then use visualization techniques such as scatter plots, histograms, etc.

Python Function Arguments | Args, Kwargs, Lambda

Python function arguments are variables that get passed during the function call.  Sometimes, the argument can have a default value. This is useful if no argument value is passed during the call.

This python function has a default value.

def shout_echo(word1, echo=1):
echo_word = word1 * echo
shout_word = echo_word + ‘!!!’
return shout_word

The function concatenates copies of a word and three
exclamation marks. It can be called with or without a value for echo. If no value for echo is passed, it assumes the default value of 1.

no_echo = shout_echo(‘Hey’)
with_echo = shout_echo(‘Hey’,5)

Functions can also have an unknown number of arguments.

def gibberish(*args):
hodgepodge = ”
for word in args:
hodgepodge += word
return hodgepodge

And it works, whether there are one or many arguments.

one_word = gibberish(‘luke’)
many_words = gibberish(“luke”, “leia”, “han”, “obi”, “darth”)

This same concept can apply to iterate through key-value pairs.

def report_status(**kwargs):
for key, value in kwargs.items():
print(key + “: ” + value)

Call the function with as many key-value pairs desired.

report_status(name=”luke”, affiliation=”jedi”, status=”missing”)

Finally, there are lambda functions. These are useful for writing anonymous functions on the fly. They are commonly used in the context of the map function.

spells = [“protego”, “accio”, “expecto patronum”, “legilimens”] shout_spells = map(lambda item: item + ‘!!!’, spells)
shout_spells_list = list(shout_spells)

How to Parse Email Data with Python

This post will examine how to parse email data with Python.

You will go step by step through the program emaildb.py.

This program will read through mbox-short.txt file, count the occurrences of each email, and put that data in a database.

In the first line of code you will import a library you need to talk to the database.

The second line of code establishes a connection to the database file emaildb.sqlite.

The third line of code creates a cursor object that allows you to make commands to the database.

email db

Next, you will call the execute method on the cursor. This program will create a new table called “Counts” every time it runs. So, first it will drop the table if it exists, and then it will create a new table.

drop table

Notice how you are using SQL commands in Python to talk to the database.

The next several lines of code are similar to what you learned in the Python dictionary post. Here, you are parsing through the file and pulling out the email address.parse data

 

The next line of code is different. This line uses a technique called parameter substitution.

parameter sub

In this line, you are selecting a row from the table that match the email. The question mark after email serves as a placeholder for a value that will be substituted in. You are substituting in the current value of the email variable.

Next, you will use a “try” statement. You want to use “try” because your program will blow up if no rows were found that matched the email from the previous line. If the email was found, then you advance the count of that email.

In the “except” statement, you are saving the program from blowing up from the email not being found in the table. Rather, you insert this new email into the table, and start the count at 1.

Finally, the conn.commit() statement is very important, because it writes your new changes to the database.

commit

The next thing you will do is run select statement to list the top 10 emails in descending order.

top 10 emails

Before closing out the program, you will loop through the rows and print the columns for each row. It is a good idea to convert your column fields to str type before printing.

print entries

Make an SQL Database

In this post you will make an SQL database.

If you have not done so already, you need to install an SQL database tool. See picture below.

SQLite

SQLite Browser reads and writes a special type of file that is optimized for rapid retrieval, and storage of data.

Run SQLite Browser and create a new database. Once complete, your screen should display the following:

create database

Now, go ahead and create a table. Use the following command on the right side of the picture shown below (assume case sensitivity). This specifies a new table called “Users”, with two columns, and up to 128 length variable characters. This code also forms a contract with the table. A contract with what is allowed, and not allowed. Put this command in the execute SQL window.

create table

It is up to you regarding the terms of the contract, but you need to plan it ahead of time.

Next, add data (“records”) in your table as shown below.

add records

You can even view the SQL Log to view the code every time you add a new record.

Now, you can learn for SQL commands to insert, delete, and retrieve data.

 

You put the SQL commands in the Execute SQL window.

At any time you can hit the “Browse Data” button to see all your records.

Following is how you delete a row.

delete row

 

This will delete all rows that have the matching email address.

The update statement allows you to find certain records and change their values.

sql update

Finally, you can do a retrieval with the statement shown below.

retrieving records

The “SELECT *” command translate to “Select All”.

You can also sort your records.

sorting records

Below is a summary of all the SQL commands you have learned in this post.

SQL summary

This covers the basics of SQL databases. The real power will come later when you learn to exploit connections between tables.

Database Application SQLite and Python

This post will introduce the database application SQLite in relation to Python.

For large project, there are two main roles. One of those roles in the database administrator, who often consults with the developer (the other role). These are rather specialized jobs.

two roles

The administrator talks directly to the database, while the developer talks to indirectly, by way of the application. The following picture illustrates this split between roles, in large projects.

large project structure

However, for smaller projects, one person can handle both roles.

By handling both these roles you will:

  1. Use the database application SQLite to create tables.
  2. Write Python programs to retrieve, clean, and put cleaned data in the tables.
  3. Write another program to pull the cleaned data out and output a nice file.

For now, you will focus on the first step. You will learn how to create a database model, or contract.

database model

There are several common database systems. Oracle dominated this market, mainly because they were the first to embrace the relational model concept. However, there are good alternatives.

common database systems

 

In the context of Python, you will use SQLite. As it turns out, SQLite is quite popular.

SQLite is Popular

SQLite is fast, and is good for smaller amounts of data. Most importantly, the database application SQLite is embedded in Python.