Tidy Data Principles for your DataFrame

Data science requires a solid understanding of tidy data principles. A good data scientist can recognize the difference between tidy and untidy data. It takes a bit of practice, but following two principles can help keep it simple.

The Two Tidy Data Principles

  1. Each variable has a separate column.
  2. Each row represents a separate observation.

Shown below are two images. Each image represents a sample of weather observation data. One is tidy, the other is not. Follow the aforementioned tidy data principles to understand.

untidy data
This data sample is untidy.
tidy data principles
This data sample is tidy.

In the first image of untidy data, the single column ‘variable’ contains all the variables. Consequently, several rows will hold the same observation, but for each variable. The second image corrects the untidy data because there are separate column for each variable.

Use the first steps of data wrangling when presented with a new data set. This is the best way to detect if the data is already tidy or not.

Untidy Data Isn’t Always Bad

The tidy data principles were adopted from a paper by Hadley Wickham as standard way to organize data for analysis. But sometimes reshaping data away from this standard can make it present better for reports. It all depends on the data.

Think of a tidy data set as the standard starting point for analysis and visualization. It is fine to reshape data if needed for a certain purpose.

Melt Away The Tidy Data

Melting data is the process of turning columns into rows. In the above images, the tidy data can be melted with the Pandas pd.melt() method. This is how the tidy image reshapes to the untidy image. Assume the tidy dataframe is called airquality, and the untidy one will be called airquality_melt.

airquality_melt = pd.melt(airquality, id_vars=[‘Month’, ‘Day’])

Notice the parameters id_vars. This is a list of columns to not melt. The parameter value_vars specifies columns to melt. Every column not in id_vars will melt by default, if value_vars is not used.

Give Descriptive Names for ‘Variable’ and ‘Value’

Refer to the ‘variable’ and ‘value’ columns in the untidy data image above. Accomplish this with var_name and value_name parameters.

airquality_melt = pd.melt(airquality, id_vars=[‘Month’, ‘Day’], var_name=’measurement’, value_name=’reading’)

Pivot is Opposite of Melt

Use the pivot_table method to get the melted version of the dataframe back to its original state.

airquality_pivot = airquality_melt.pivot_table(index=[‘Month’, ‘Day’], columns=’measurement’, values=’reading’)

# Reset the index of airquality_pivot
airquality_pivot = airquality_pivot.reset_index()

Pivot will not work if there are duplicate rows of observations. Duplicates can be dealt with by providing an aggregate function. Use np.mean for the aggregate function to reduce duplicates with the mean value.

airquality_pivot = airquality_dup.pivot_table(index=[‘Month’, ‘Day’], columns=’measurement’, values=’reading’, aggfunc=np.mean)

# Reset the index of airquality_pivot
airquality_pivot = airquality_pivot.reset_index()

Data Wrangling Cheat Sheet for Python

A data wrangling cheat sheet for python data-scientists starts with an initial exploratory analysis. This is the crucial first step in data cleaning. Just like an experienced chess player’s first moves can be scripted, a data-scientist might have several scripted steps to get familiar with the data.

Data Wrangling Cheat Sheet First Steps

Some good commands to start the data analysis are to import pandas, load the data into a data frame, and get an overview of the data. These commands are as follows:

# Import pandas
import pandas as pd# Read the file into a DataFrame: df
df = pd.read_csv(‘file_name.csv’)# Print the head of df
print(df.head())

# Print the tail of df
print(df.tail())

# Print the shape of df
print(df.shape)

# Print the columns of df
print(df.columns)

The methods above will reveal if the data alredy meets the standard of tidy data. Note that .shape and .columns are not methods, but attributes. Thus, they don’t require the parenthesis.

Next, consider these steps:

# Print the info of df
print(df.info())# Print the value counts for ‘String_Column’
print(df[‘String_Column’].value_counts(dropna=False))# Print the description for ‘Numeric_Column’
print(df[‘Numeric_Column’].describe())

The .info() method will indicate if there is missing data. Moreover, it tells the number of rows, number of columns, and the data type for each column.

The .describe() method will calculate summary statistics on numeric data columns.

The .value_counts() method returns the frequency counts for each unique value in a column. This method also has an optional parameter called dropna which is True by default. Set dropna=False so it will provide the frequency counts of missing data as well.

Once a through exploration of the data is complete, then use visualization techniques such as scatter plots, histograms, etc.

Python Scatter Plot Example Using Matplotlib

A Python scatter plot example can be used as a reference to build another plot, or to remind us about the proper syntax.

Python scatter plots example often use the Matplotlib library because it is arguably the most powerful Python library for data visualization. It is usually used in combination with the Python Numpy library.

Suppose you have two Python lists. One is a list of  home prices, and the other list represents the size of the living area. You want to use these lists to see if there is a correlation between the two. This problem calls for a simple linear regression analysis. However, a scatter plot can help infer if there is a strong or weak correlation.

The Python Scatter Plot Example

The list for home prices is:

homeprice = [208500, 181500, 223500, 140000, 250000, 143000, 307000, 200000, 129900, 118000, 129500, 345000, 144000, 279500, 157000, 132000, 149000, 90000, 159000, 139000, 325300]

The list for living area size is:

livearea = [1710, 1262, 1786, 1717, 2198, 1362, 1694, 2090, 1774, 1077, 1040, 2324, 912, 1494, 1253, 854, 1004, 1296, 1114, 1339, 2376]

The next step is to import the libraries.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Inline comments can explain the remaining steps.

# Convert lists to numpy arrays
np_homeprice = np.array(homeprice)
np_livearea = np.array(livearea)

# Set arguments for the x and y axis
plt.scatter(np_livearea, np_homeprice)

# Label x and y axis
plt.xlabel(‘Living Area Square Footage’)
plt.ylabel(‘Sale Price of Home’)

# Give title to plot
plt.title(‘Home Sale Price vs. Size of Living Area’)

# Display the plot
plt.show()

Executing this code will result in a scatter plot.

Python Scatter Plot Example

Play with this example in the interactive Google Colab.