To split a column in your data frame is necessary when multiple variable values are contained in a single column. Data usually does not come all tidy like we want it. In cases where a single column provides multiple features, splitting a column is a must.
Suppose you have column in your data with headings:
The column ‘m014’, for example, represents the number of males in the 0-14 age group.
The first step could be to melt the data. The objective is to have only two distinct columns for gender and age group. If you recall from the post on melting data, the ‘country’ and ‘year’ columns are kept by making them id_vars.
# Melt df and keep country and year fixed
df_melt = pd.melt(df, id_vars=[‘country’, ‘year’])
This melt method converts the ‘m014’ like columns to rows. A new heading of ‘variable’ holds the gender / age-group, and a new heading of ‘value’ holds the number of people in a certain age-group.
Use the python slice string method .str to slice the gender / age-group variables like so:
# Create the ‘gender’ column
df_melt[‘gender’] = df_melt.variable.str
# Create the ‘age_group’ column
df_melt[‘age_group’] = df_melt.variable.str[1:]
Thus, two more columns created with the headings of ‘gender’ and ‘age group’. Print the head to see the results.
Data science requires a solid understanding of tidy data principles. A good data scientist can recognize the difference between tidy and untidy data. It takes a bit of practice, but following two principles can help keep it simple.
The Two Tidy Data Principles
Each variable has a separate column.
Each row represents a separate observation.
Shown below are two images. Each image represents a sample of weather observation data. One is tidy, the other is not. Follow the aforementioned tidy data principles to understand.
In the first image of untidy data, the single column ‘variable’ contains all the variables. Consequently, several rows will hold the same observation, but for each variable. The second image corrects the untidy data because there are separate column for each variable.
The tidy data principles were adopted from a paper by Hadley Wickham as standard way to organize data for analysis. But sometimes reshaping data away from this standard can make it present better for reports. It all depends on the data.
Think of a tidy data set as the standard starting point for analysis and visualization. It is fine to reshape data if needed for a certain purpose.
Melt Away The Tidy Data
Melting data is the process of turning columns into rows. In the above images, the tidy data can be melted with the Pandas pd.melt() method. This is how the tidy image reshapes to the untidy image. Assume the tidy dataframe is called airquality, and the untidy one will be called airquality_melt.
# Reset the index of airquality_pivot
airquality_pivot = airquality_pivot.reset_index()
Pivot will not work if there are duplicate rows of observations. Duplicates can be dealt with by providing an aggregate function. Use np.mean for the aggregate function to reduce duplicates with the mean value.
A data wrangling cheat sheet for python data-scientists starts with an initial exploratory analysis. This is the crucial first step in data cleaning. Just like an experienced chess player’s first moves can be scripted, a data-scientist might have several scripted steps to get familiar with the data.
Data Wrangling Cheat Sheet First Steps
Some good commands to start the data analysis are to import pandas, load the data into a data frame, and get an overview of the data. These commands are as follows:
# Import pandas
import pandas as pd# Read the file into a DataFrame: df
df = pd.read_csv(‘file_name.csv’)# Print the head of df
# Print the tail of df
# Print the shape of df
# Print the columns of df
The methods above will reveal if the data alredy meets the standard of tidy data. Note that .shape and .columns are not methods, but attributes. Thus, they don’t require the parenthesis.
Next, consider these steps:
# Print the info of df
print(df.info())# Print the value counts for ‘String_Column’
print(df[‘String_Column’].value_counts(dropna=False))# Print the description for ‘Numeric_Column’
The .info() method will indicate if there is missing data. Moreover, it tells the number of rows, number of columns, and the data type for each column.
The .describe() method will calculate summary statistics on numeric data columns.
The .value_counts() method returns the frequency counts for each unique value in a column. This method also has an optional parameter called dropna which is True by default. Set dropna=False so it will provide the frequency counts of missing data as well.
Once a through exploration of the data is complete, then use visualization techniques such as scatter plots, histograms, etc.
This post will examine how to parse email data with Python.
You will go step by step through the program emaildb.py.
This program will read through mbox-short.txt file, count the occurrences of each email, and put that data in a database.
In the first line of code you will import a library you need to talk to the database.
The second line of code establishes a connection to the database file emaildb.sqlite.
The third line of code creates a cursor object that allows you to make commands to the database.
Next, you will call the execute method on the cursor. This program will create a new table called “Counts” every time it runs. So, first it will drop the table if it exists, and then it will create a new table.
Notice how you are using SQL commands in Python to talk to the database.
The next several lines of code are similar to what you learned in the Python dictionary post. Here, you are parsing through the file and pulling out the email address.
The next line of code is different. This line uses a technique called parameter substitution.
In this line, you are selecting a row from the table that match the email. The question mark after email serves as a placeholder for a value that will be substituted in. You are substituting in the current value of the email variable.
Next, you will use a “try” statement. You want to use “try” because your program will blow up if no rows were found that matched the email from the previous line. If the email was found, then you advance the count of that email.
In the “except” statement, you are saving the program from blowing up from the email not being found in the table. Rather, you insert this new email into the table, and start the count at 1.
Finally, the conn.commit() statement is very important, because it writes your new changes to the database.
The next thing you will do is run select statement to list the top 10 emails in descending order.
Before closing out the program, you will loop through the rows and print the columns for each row. It is a good idea to convert your column fields to str type before printing.
If you have not done so already, you need to install an SQL database tool. See picture below.
SQLite Browser reads and writes a special type of file that is optimized for rapid retrieval, and storage of data.
Run SQLite Browser and create a new database. Once complete, your screen should display the following:
Now, go ahead and create a table. Use the following command on the right side of the picture shown below (assume case sensitivity). This specifies a new table called “Users”, with two columns, and up to 128 length variable characters. This code also forms a contract with the table. A contract with what is allowed, and not allowed. Put this command in the execute SQL window.
It is up to you regarding the terms of the contract, but you need to plan it ahead of time.
Next, add data (“records”) in your table as shown below.
You can even view the SQL Log to view the code every time you add a new record.
Now, you can learn for SQL commands to insert, delete, and retrieve data.
You put the SQL commands in the Execute SQL window.
At any time you can hit the “Browse Data” button to see all your records.
Following is how you delete a row.
This will delete all rows that have the matching email address.
The update statement allows you to find certain records and change their values.
Finally, you can do a retrieval with the statement shown below.
The “SELECT *” command translate to “Select All”.
You can also sort your records.
Below is a summary of all the SQL commands you have learned in this post.
This covers the basics of SQL databases. The real power will come later when you learn to exploit connections between tables.
It is a good idea to understand the importance of using databases with Python while learning Python. If you are analyzing data, pulling data from over the network, then it makes sense to store that data in a database. You can then set up a process of pulling data from the database as you need it. Overall, it can speed up your workflow.
Relational databases comprise a whole sub-field of computer science. They are relevant because you can pull out an entry from huge amounts of data in a split-second. It would take you much longer if you had to read through the data.
You can look no further than Oracle to understand the relevance of relational databases. The majority of their revenue comes from database products.
The underlying foundation of databases is rooted in mathematics. This is present in the terminology that experts use to describe databases.
The idea behind databases is that you model data at a connection point.
However, programmers tend to think of it in terms of rows and columns.
Typically, when you make a table, the first row becomes metadata for the table. You often use the first row to title what each column is for. Therefore, you can refer to this first row as the schema for the table. It sets the rules for each column with regard to what goes there, for example, a string, an integer, etc.
In the early 1960s, the database pioneers figured out ways to quickly retrieve data from random access memory, without having to go through the data sequentially. However, databases were very complex. As a result, a new component of internet architecture evolved called the database application. This allows you, the programmer, to talk with the database, by way of the database application. At this point, an industry standard was desired for the language for the API between a database and its application. The name of the language the industry agreed on was SQL (Structured Query Language).
SQL is a great language, but it depends on the data being clean . The nice thing about Python, is it can deal really well with unstructured data. So together, Python and SQL, you have a powerful combination.
This post will introduce the concept of inheritance.
Inheritance is a very important concept for object orientation. It is the idea that you can write a class, and borrow stuff from another class. It can also be referred to as extending a class.
Look at the code below where you have the “PartyAnimal” class, and a new class “FootballFan”. “FootballFan” extends “PartyAnimal”. Thus, “FootballFan” has the member variable “x”, “name”, and “points”. Everything from “PartyAnimal” gets pulled into the new class, plus another method is added.
You can see that “FootballFan” is a class which extends “PartyAnimal”. It has all the capabilities of “PartyAnimal” and more.