Split a Column in the DataFrame with Pandas and Python

To split a column in your data frame is necessary when multiple variable values are contained in a single column. Data usually does not come all tidy like we want it. In cases where a single column provides multiple features, splitting a column is a must.

Suppose you have column in your data with headings:

[‘country’, ‘year’, ‘m014’, ‘f014’, ‘m1524’, ‘f1524’]

The column ‘m014’, for example, represents the number of males in the 0-14 age group.

The first step could be to melt the data. The objective is to have only two distinct columns for gender and age group. If you recall from the post on melting data, the ‘country’ and ‘year’ columns are kept by making them id_vars.

# Melt df and keep country and year fixed
df_melt = pd.melt(df, id_vars=[‘country’, ‘year’])

This melt method converts the ‘m014’ like columns to rows. A new heading of ‘variable’ holds the gender / age-group, and a new heading of ‘value’ holds the number of people in a certain age-group.
Use the python slice string method .str to slice the gender / age-group variables like so:

# Create the ‘gender’ column
df_melt[‘gender’] = df_melt.variable.str[0]

# Create the ‘age_group’ column
df_melt[‘age_group’] = df_melt.variable.str[1:]

Thus, two more columns created with the headings of ‘gender’ and ‘age group’. Print the head to see the results.

# Print the head of df_melt
print(df_melt.head())

split a column

Tidy Data Principles for your DataFrame

Data science requires a solid understanding of tidy data principles. A good data scientist can recognize the difference between tidy and untidy data. It takes a bit of practice, but following two principles can help keep it simple.

The Two Tidy Data Principles

  1. Each variable has a separate column.
  2. Each row represents a separate observation.

Shown below are two images. Each image represents a sample of weather observation data. One is tidy, the other is not. Follow the aforementioned tidy data principles to understand.

untidy data
This data sample is untidy.
tidy data principles
This data sample is tidy.

In the first image of untidy data, the single column ‘variable’ contains all the variables. Consequently, several rows will hold the same observation, but for each variable. The second image corrects the untidy data because there are separate column for each variable.

Use the first steps of data wrangling when presented with a new data set. This is the best way to detect if the data is already tidy or not.

Untidy Data Isn’t Always Bad

The tidy data principles were adopted from a paper by Hadley Wickham as standard way to organize data for analysis. But sometimes reshaping data away from this standard can make it present better for reports. It all depends on the data.

Think of a tidy data set as the standard starting point for analysis and visualization. It is fine to reshape data if needed for a certain purpose.

Melt Away The Tidy Data

Melting data is the process of turning columns into rows. In the above images, the tidy data can be melted with the Pandas pd.melt() method. This is how the tidy image reshapes to the untidy image. Assume the tidy dataframe is called airquality, and the untidy one will be called airquality_melt.

airquality_melt = pd.melt(airquality, id_vars=[‘Month’, ‘Day’])

Notice the parameters id_vars. This is a list of columns to not melt. The parameter value_vars specifies columns to melt. Every column not in id_vars will melt by default, if value_vars is not used.

Give Descriptive Names for ‘Variable’ and ‘Value’

Refer to the ‘variable’ and ‘value’ columns in the untidy data image above. Accomplish this with var_name and value_name parameters.

airquality_melt = pd.melt(airquality, id_vars=[‘Month’, ‘Day’], var_name=’measurement’, value_name=’reading’)

Pivot is Opposite of Melt

Use the pivot_table method to get the melted version of the dataframe back to its original state.

airquality_pivot = airquality_melt.pivot_table(index=[‘Month’, ‘Day’], columns=’measurement’, values=’reading’)

# Reset the index of airquality_pivot
airquality_pivot = airquality_pivot.reset_index()

Pivot will not work if there are duplicate rows of observations. Duplicates can be dealt with by providing an aggregate function. Use np.mean for the aggregate function to reduce duplicates with the mean value.

airquality_pivot = airquality_dup.pivot_table(index=[‘Month’, ‘Day’], columns=’measurement’, values=’reading’, aggfunc=np.mean)

# Reset the index of airquality_pivot
airquality_pivot = airquality_pivot.reset_index()

Data Wrangling Cheat Sheet for Python

A data wrangling cheat sheet for python data-scientists starts with an initial exploratory analysis. This is the crucial first step in data cleaning. Just like an experienced chess player’s first moves can be scripted, a data-scientist might have several scripted steps to get familiar with the data.

Data Wrangling Cheat Sheet First Steps

Some good commands to start the data analysis are to import pandas, load the data into a data frame, and get an overview of the data. These commands are as follows:

# Import pandas
import pandas as pd# Read the file into a DataFrame: df
df = pd.read_csv(‘file_name.csv’)# Print the head of df
print(df.head())

# Print the tail of df
print(df.tail())

# Print the shape of df
print(df.shape)

# Print the columns of df
print(df.columns)

The methods above will reveal if the data alredy meets the standard of tidy data. Note that .shape and .columns are not methods, but attributes. Thus, they don’t require the parenthesis.

Next, consider these steps:

# Print the info of df
print(df.info())# Print the value counts for ‘String_Column’
print(df[‘String_Column’].value_counts(dropna=False))# Print the description for ‘Numeric_Column’
print(df[‘Numeric_Column’].describe())

The .info() method will indicate if there is missing data. Moreover, it tells the number of rows, number of columns, and the data type for each column.

The .describe() method will calculate summary statistics on numeric data columns.

The .value_counts() method returns the frequency counts for each unique value in a column. This method also has an optional parameter called dropna which is True by default. Set dropna=False so it will provide the frequency counts of missing data as well.

Once a through exploration of the data is complete, then use visualization techniques such as scatter plots, histograms, etc.

Python Function Arguments | Args, Kwargs, Lambda

Python function arguments are variables that get passed during the function call.  Sometimes, the argument can have a default value. This is useful if no argument value is passed during the call.

This python function has a default value.

def shout_echo(word1, echo=1):
echo_word = word1 * echo
shout_word = echo_word + ‘!!!’
return shout_word

The function concatenates copies of a word and three
exclamation marks. It can be called with or without a value for echo. If no value for echo is passed, it assumes the default value of 1.

no_echo = shout_echo(‘Hey’)
with_echo = shout_echo(‘Hey’,5)

Functions can also have an unknown number of arguments.

def gibberish(*args):
hodgepodge = ”
for word in args:
hodgepodge += word
return hodgepodge

And it works, whether there are one or many arguments.

one_word = gibberish(‘luke’)
many_words = gibberish(“luke”, “leia”, “han”, “obi”, “darth”)

This same concept can apply to iterate through key-value pairs.

def report_status(**kwargs):
for key, value in kwargs.items():
print(key + “: ” + value)

Call the function with as many key-value pairs desired.

report_status(name=”luke”, affiliation=”jedi”, status=”missing”)

Finally, there are lambda functions. These are useful for writing anonymous functions on the fly. They are commonly used in the context of the map function.

spells = [“protego”, “accio”, “expecto patronum”, “legilimens”]
shout_spells = map(lambda item: item + ‘!!!’, spells)
shout_spells_list = list(shout_spells)

Python Scatter Plot Example Using Matplotlib

A Python scatter plot example can be used as a reference to build another plot, or to remind us about the proper syntax.

Python scatter plots example often use the Matplotlib library because it is arguably the most powerful Python library for data visualization. It is usually used in combination with the Python Numpy library.

Suppose you have two Python lists. One is a list of  home prices, and the other list represents the size of the living area. You want to use these lists to see if there is a correlation between the two. This problem calls for a simple linear regression analysis. However, a scatter plot can help infer if there is a strong or weak correlation.

The Python Scatter Plot Example

The list for home prices is:

homeprice = [208500, 181500, 223500, 140000, 250000, 143000, 307000, 200000, 129900, 118000, 129500, 345000, 144000, 279500, 157000, 132000, 149000, 90000, 159000, 139000, 325300]

The list for living area size is:

livearea = [1710, 1262, 1786, 1717, 2198, 1362, 1694, 2090, 1774, 1077, 1040, 2324, 912, 1494, 1253, 854, 1004, 1296, 1114, 1339, 2376]

The next step is to import the libraries.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Inline comments can explain the remaining steps.

# Convert lists to numpy arrays
np_homeprice = np.array(homeprice)
np_livearea = np.array(livearea)

# Set arguments for the x and y axis
plt.scatter(np_livearea, np_homeprice)

# Label x and y axis
plt.xlabel(‘Living Area Square Footage’)
plt.ylabel(‘Sale Price of Home’)

# Give title to plot
plt.title(‘Home Sale Price vs. Size of Living Area’)

# Display the plot
plt.show()

Executing this code will result in a scatter plot.

Python Scatter Plot Example

Play with this example in the interactive Google Colab.

Simple Linear Regression Example in R

A Simple Linear Regression Example can help reinforce the intuition of simple linear regression models. Consider an example of salary vs. years of experience. This is a good example to start with because the results intuitively make sense.

  • The null hypothesis will be that years of experience has no impact on salary.
  • The significance level is set at 5%. This level is arbitrary, but is most often used.
  • If the p-value for the years of experience variable is less then the significance level then the null hypothesis is rejected.

The first two steps in R for this simple linear regression example are to import the dataset, and split the dataset into a training and test set. The syntax for this is as follows:

dataset = read.csv(‘Salary_Data.csv’)install.packages(‘caTools’)
library(caTools)
set.seed(123)
split = sample.split(dataset$Salary, SplitRatio = 2/3)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)

It is assumed that the csv data file is in the working directory. It contains thirty rows of observations. The caTools package is used to split the dataset, wherein Salary is the dependent variable. The split ratio was set at two-thirds. So, the training set will have twenty observations and the test set will have ten.

Once the previous code is executed, the next step is to fit the simple linear regression to the training set. In essence, this creates the line of best fit.

regressor = lm(formula = Salary ~ YearsExperience,
data = training_set)

In the above syntax, the lm function was used to build the regression model. The two essential parameters were passed which are the formula and the data.

Remember, this model was built with the twenty observations in the training set. This model has no idea of what the observations are in the test set. So the test set can be used to make predictions based on the model. And then, predictions from the test set can be compared with data from the test set. The predict function is used to make predictions.

y_pred = predict(regressor, newdata = test_set)

It may help to write the predictions to a new csv file:

write.csv(y_pred, file = “salary_predictions.csv)

With a bit of finesse, the ggplot2 library can be used to plot the test set data against the regression model that was built with the training set.

Simple Linear Regression Example
The red dots represent actual data from the test set.

It is easy to see that the regression model does a great job of predicting results from the test set. In some cases, the red data points sit very near the line itself, so some predictions are very accurate.

Using the summary(regressor) function outputs the data we need to definitively if the null hypothesis should be rejected.

In this simple linear regression example the summary tells us the p-value for the years of experience variable is 1.52e-14. This is much less than 1%, so it easily falls below the significance level of 5%. Thus, the null hypothesis is rejected. In other words, there is a correlation between years of experience vs. salary.

Furthermore, the summary indicates an R squared value of 0.9649. This is close to 1. It is safe to say the correlation is very strong.

Confusion Matrix in Data Mining Explained

The Confusion Matrix in data mining is used to explain Type I and a Type II errors from your results. These results are also referred to as false positives and false negatives.

confusion matrix in data mining

A false positive is when something is predicted to occur but does not occur. A false negative is when something is predicted to not occur, but it does occur.

The common notation is:

  • y for the actual values
  • y^ for predicted values

A confusion matrix in data mining can give a quick overview of how the prediction model has performed. It is used to see accuracy in Logistic Regression and K-Nearest Neighbor classification models, for example.

Conusion Matrix Accuracy
A confusion matrix makes it easy to calculate the accuracy and error rates .

accuracy rateIn the example above, the prediction model accurately predicted 35 events  that did not occur. And it accurately predicted 50 events that did not occur. The test set in this example has 100 events. From this, finding the accuracy or error rate is quite simple.

So, don’t let the name confuse you!

KNN Classifier Algorithm

The KNN Classifier Algorithm (K-Nearest Neighbor) is straight forward and not difficult to understand. Assume you have a dataset and have already identified two categories of data from the set. For the sake of simplicity, let’s say the dataset is represented by two columns of data, X1 and X2. The question is, how do you determine which category a new data point would belong to? Does it best fit the first category, or the second?

knn classifier algorithm

This is where the K-Nearest Neighbor algorithm comes in. It will help us classify the new data point.

 

 

 

KNN Classifier Algorithm Steps

knn steps

Typically, the number of neighbors chosen is 5. And the euclidean distance formula is mostly used. Other numbers of neighbors can be used, and a different distance formula can be used. It’s up the person to decide on how they want the model built.

euclidean distance formula
The Euclidan distance formula is most commonly used in the KNN Classifier Algorithm

As you can see in our example; the new data point is closer with two points in the green category, and with three points in the red category. We have exhausted our number of neighbors of five that we set for the algorithm, so we classify the new data point in the red category.

KNN Model
In this example, the K-Nearest Neighbor process dictates the new data point to belong in the red category

While the K-Nearest Neighbor Algorithm is based on a simple concept, it can can model some surprising accurate predictions.

Logistic Regression Model Intuition

A Logistic Regression Model is made from statistical analysis in which there are one or more independent variables that determine a binary outcome.

For example, a company sends out mailers to buy a product. The company has data that shows the age of the customer and if they bought it or not.

Logistic Regression Model
This Logistic Regression Model represents the age of the consumer and if they bought the product or not.

You can see the data implies that older people are more likely to buy the product.

Can this be modeled? A simple linear regression model will not work well. Moreover, a Linear Regression extends beyond the 1 value. It would be silly to say there is more than a 100% chance of anything to happen.

Linear Regression Model
This is a Linear Regression Model, but it does not fit the data well.

The key to remember for this example is you want to predict probability, and probability ranges from 0 to 1.

Logistic Regression Model Formula

Logistic Regression FormulaTo get the formula for a Logistic Regression Model, you apply the Sigmoid Function to a the Simple Linear Regression equation. Solve for y inside the Sigmoid Function, and substitute the value of y in theLinear Equation.

 

Use of the Logistic Regression formula transforms the look of a Linear Regression Model. With this formula you can predict probability.

Take four random variables for the independent variable x. Project the values on the curve. These projections are the fitted values.

predict probability

This information allows to give probability. It works slightly different if you want to make a binary prediction. In this case, you make a prediction if the customer will buy the product.

To make a prediction you choose an arbitrary, horizontal line. The 50% line is a fair line to choose. And then, any projected values on the Logistic Regression Model that shows below this line you would make a no precition. Any value above the line you would predict a yes value.

After predictions are made, a confusion matrix is used to give the accuracy of the predictions.

confusion matrix

The second column of the top row gives he number or false positives. This is an outcome predicted to happen but in reality did not happen. The second row of the first column show the number of false negatives.

R Squared Value Explained – For Regression Models

The R Squared value is a useful parameter for interpreting statistical results. However, it is often used without a clear understanding of its underlying principles.

The ordinary least squares method is used for finding the best fit line for a simple linear regression model. With this method, you find the sum of all the squared differences between the actual values and the predicted values on the regression line. This sum is found for all regression lines. The line with the least sum becomes the regression model, because it’s the best fitting line. The sum itself can be referred to as the sum of squares of residuals (SSres).

sum of squares of residuals
The sum of squares of residuals is sum of all the squared differences between the actual values and the predicted values on the regression line.

Now, consider the average line. For example, in the salary vs. experience example, the average line represents the average salary. If you take the squared some of differences between the actual observation points, and the corresponding points on the average line, then you have what is called the total sum of squares (SStot). Once you have this, you can find the R Squared value.

R Squared Value
Understanding how the R Squared value is calculated will give insight to the meaning of its value.

The R Squared Value Close to 1 is Good

Since the ordinary least squares method finds the minimum SSres value, then the smaller SSres value you have will result in R2 being closer to 1. The closer your R2 value is to 1 indicates a better regression line. And it could indicate that your regression model will make better predictions for test data.

To say the R Squared value in words; it is one minus the sum of squares of residuals, divided by the total sum of squares.