## Python Scatter Plot Example Using Matplotlib

A Python scatter plot example can be used as a reference to build another plot, or to remind us about the proper syntax.

Python scatter plots example often use the Matplotlib library because it is arguably the most powerful Python library for data visualization. It is usually used in combination with the Python Numpy library.

Suppose you have two Python lists. One is a list of  home prices, and the other list represents the size of the living area. You want to use these lists to see if there is a correlation between the two. This problem calls for a simple linear regression analysis. However, a scatter plot can help infer if there is a strong or weak correlation.

## The Python Scatter Plot Example

The list for home prices is:

homeprice = [208500, 181500, 223500, 140000, 250000, 143000, 307000, 200000, 129900, 118000, 129500, 345000, 144000, 279500, 157000, 132000, 149000, 90000, 159000, 139000, 325300]

The list for living area size is:

livearea = [1710, 1262, 1786, 1717, 2198, 1362, 1694, 2090, 1774, 1077, 1040, 2324, 912, 1494, 1253, 854, 1004, 1296, 1114, 1339, 2376]

The next step is to import the libraries.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Inline comments can explain the remaining steps.

# Convert lists to numpy arrays
np_homeprice = np.array(homeprice)
np_livearea = np.array(livearea)

# Set arguments for the x and y axis
plt.scatter(np_livearea, np_homeprice)

# Label x and y axis
plt.xlabel(‘Living Area Square Footage’)
plt.ylabel(‘Sale Price of Home’)

# Give title to plot
plt.title(‘Home Sale Price vs. Size of Living Area’)

# Display the plot
plt.show()

Executing this code will result in a scatter plot. Play with this example in the interactive Google Colab.

## Simple Linear Regression Example in R

A Simple Linear Regression Example can help reinforce the intuition of simple linear regression models. Consider an example of salary vs. years of experience. This is a good example to start with because the results intuitively make sense.

• The null hypothesis will be that years of experience has no impact on salary.
• The significance level is set at 5%. This level is arbitrary, but is most often used.
• If the p-value for the years of experience variable is less then the significance level then the null hypothesis is rejected.

The first two steps in R for this simple linear regression example are to import the dataset, and split the dataset into a training and test set. The syntax for this is as follows:

dataset = read.csv(‘Salary_Data.csv’)install.packages(‘caTools’)
library(caTools)
set.seed(123)
split = sample.split(dataset\$Salary, SplitRatio = 2/3)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)

It is assumed that the csv data file is in the working directory. It contains thirty rows of observations. The caTools package is used to split the dataset, wherein Salary is the dependent variable. The split ratio was set at two-thirds. So, the training set will have twenty observations and the test set will have ten.

Once the previous code is executed, the next step is to fit the simple linear regression to the training set. In essence, this creates the line of best fit.

regressor = lm(formula = Salary ~ YearsExperience,
data = training_set)

In the above syntax, the lm function was used to build the regression model. The two essential parameters were passed which are the formula and the data.

Remember, this model was built with the twenty observations in the training set. This model has no idea of what the observations are in the test set. So the test set can be used to make predictions based on the model. And then, predictions from the test set can be compared with data from the test set. The predict function is used to make predictions.

y_pred = predict(regressor, newdata = test_set)

It may help to write the predictions to a new csv file:

write.csv(y_pred, file = “salary_predictions.csv)

With a bit of finesse, the ggplot2 library can be used to plot the test set data against the regression model that was built with the training set. The red dots represent actual data from the test set.

It is easy to see that the regression model does a great job of predicting results from the test set. In some cases, the red data points sit very near the line itself, so some predictions are very accurate.

Using the summary(regressor) function outputs the data we need to definitively if the null hypothesis should be rejected.

In this simple linear regression example the summary tells us the p-value for the years of experience variable is 1.52e-14. This is much less than 1%, so it easily falls below the significance level of 5%. Thus, the null hypothesis is rejected. In other words, there is a correlation between years of experience vs. salary.

Furthermore, the summary indicates an R squared value of 0.9649. This is close to 1. It is safe to say the correlation is very strong.