Confusion Matrix in Data Mining Explained

The Confusion Matrix in data mining is used to explain Type I and a Type II errors from your results. These results are also referred to as false positives and false negatives.

confusion matrix in data mining

A false positive is when something is predicted to occur but does not occur. A false negative is when something is predicted to not occur, but it does occur.

The common notation is:

  • y for the actual values
  • y^ for predicted values

A confusion matrix in data mining can give a quick overview of how the prediction model has performed. It is used to see accuracy in Logistic Regression and K-Nearest Neighbor classification models, for example.

Conusion Matrix Accuracy
A confusion matrix makes it easy to calculate the accuracy and error rates .

accuracy rateIn the example above, the prediction model accurately predicted 35 events  that did not occur. And it accurately predicted 50 events that did not occur. The test set in this example has 100 events. From this, finding the accuracy or error rate is quite simple.

So, don’t let the name confuse you!

KNN Classifier Algorithm

The KNN Classifier Algorithm (K-Nearest Neighbor) is straight forward and not difficult to understand. Assume you have a dataset and have already identified two categories of data from the set. For the sake of simplicity, let’s say the dataset is represented by two columns of data, X1 and X2. The question is, how do you determine which category a new data point would belong to? Does it best fit the first category, or the second?

knn classifier algorithm

This is where the K-Nearest Neighbor algorithm comes in. It will help us classify the new data point.




KNN Classifier Algorithm Steps

knn steps

Typically, the number of neighbors chosen is 5. And the euclidean distance formula is mostly used. Other numbers of neighbors can be used, and a different distance formula can be used. It’s up the person to decide on how they want the model built.

euclidean distance formula
The Euclidan distance formula is most commonly used in the KNN Classifier Algorithm

As you can see in our example; the new data point is closer with two points in the green category, and with three points in the red category. We have exhausted our number of neighbors of five that we set for the algorithm, so we classify the new data point in the red category.

KNN Model
In this example, the K-Nearest Neighbor process dictates the new data point to belong in the red category

While the K-Nearest Neighbor Algorithm is based on a simple concept, it can can model some surprising accurate predictions.

Logistic Regression Model Intuition

A Logistic Regression Model is made from statistical analysis in which there are one or more independent variables that determine a binary outcome.

For example, a company sends out mailers to buy a product. The company has data that shows the age of the customer and if they bought it or not.

Logistic Regression Model
This Logistic Regression Model represents the age of the consumer and if they bought the product or not.

You can see the data implies that older people are more likely to buy the product.

Can this be modeled? A simple linear regression model will not work well. Moreover, a Linear Regression extends beyond the 1 value. It would be silly to say there is more than a 100% chance of anything to happen.

Linear Regression Model
This is a Linear Regression Model, but it does not fit the data well.

The key to remember for this example is you want to predict probability, and probability ranges from 0 to 1.

Logistic Regression Model Formula

Logistic Regression FormulaTo get the formula for a Logistic Regression Model, you apply the Sigmoid Function to a the Simple Linear Regression equation. Solve for y inside the Sigmoid Function, and substitute the value of y in theLinear Equation.


Use of the Logistic Regression formula transforms the look of a Linear Regression Model. With this formula you can predict probability.

Take four random variables for the independent variable x. Project the values on the curve. These projections are the fitted values.

predict probability

This information allows to give probability. It works slightly different if you want to make a binary prediction. In this case, you make a prediction if the customer will buy the product.

To make a prediction you choose an arbitrary, horizontal line. The 50% line is a fair line to choose. And then, any projected values on the Logistic Regression Model that shows below this line you would make a no precition. Any value above the line you would predict a yes value.

After predictions are made, a confusion matrix is used to give the accuracy of the predictions.

confusion matrix

The second column of the top row gives he number or false positives. This is an outcome predicted to happen but in reality did not happen. The second row of the first column show the number of false negatives.

R Squared Value Explained – For Regression Models

The R Squared value is a useful parameter for interpreting statistical results. However, it is often used without a clear understanding of its underlying principles.

The ordinary least squares method is used for finding the best fit line for a simple linear regression model. With this method, you find the sum of all the squared differences between the actual values and the predicted values on the regression line. This sum is found for all regression lines. The line with the least sum becomes the regression model, because it’s the best fitting line. The sum itself can be referred to as the sum of squares of residuals (SSres).

sum of squares of residuals
The sum of squares of residuals is sum of all the squared differences between the actual values and the predicted values on the regression line.

Now, consider the average line. For example, in the salary vs. experience example, the average line represents the average salary. If you take the squared some of differences between the actual observation points, and the corresponding points on the average line, then you have what is called the total sum of squares (SStot). Once you have this, you can find the R Squared value.

R Squared Value
Understanding how the R Squared value is calculated will give insight to the meaning of its value.

The R Squared Value Close to 1 is Good

Since the ordinary least squares method finds the minimum SSres value, then the smaller SSres value you have will result in R2 being closer to 1. The closer your R2 value is to 1 indicates a better regression line. And it could indicate that your regression model will make better predictions for test data.

To say the R Squared value in words; it is one minus the sum of squares of residuals, divided by the total sum of squares.

Multiple Linear Regression Intuition

Multiple linear regression works the same as simple linear regression, except for the introduction of more independent variables and their corresponding coefficients.

multiple linear regression

When there are more independent variables, the assumption is there are multiple factors affecting the outcome of the dependent variable. A predictive regressor model can be more accurate if multiple independent variables are known.

Following are some examples:

If the dependent variable is profit, then independent variables could be R&D spending and marketing expenditures.

If the dependent variable is grade, then independent variables could be time of study, and hours of sleep.

When to Make Dummy Variables for Multiple Linear Regression

categorical variables

Consider the example where profit is the dependent variable. The challenge is find the correlation of how the independent variables affect profit. In the image above, in the independent variables have a blue background in the header. The first three independent variables are expenditures. Thus, it’s easy to associate each expenditure to a variable. But for the state, it’s not that simple. This is called a categorical variable. The approach to use in this instance, is to create dummy variables for the categorical variable.

dummy variables

As shown in the above image, you need to create separate columns for each category. You populate each row of your dummy variable column with a 1, if that row in the state column matches the heading for the dummy variable column. Otherwise, you populate that row with a zero. And you only need to include on of the dummy variables in your equation. In the image above, we know that if D1 is a 1, then the company is in New York. If it’s zero, then the company is in California. We do not lose any information by including only one dummy variable in the equation. This approach may seem bias, because you lose the b4 coefficient when the state is California. But that is not the case. The regression model works by altering the equation for California by changing the b0 coefficient.

Avoid The Dummy Variable Trap

dummy variable trap

The problem with including both dummy variables is it amounts to a duplication of variables. This phenomenon, when one (or more) variable predicts another is called multicollinearity. The result is that the regression model will not work as it should because it is not able to distinguish the effects of the one dummy variable from the other. This problem is referred to as the dummy variable trap. The key takeaway is, when building a multiple linear regression model with dummy variables, you should always omit one dummy variable from the equation. This rule applies irrespective of the number of dummy variables.

Finally, there are different ways to build a multiple linear regression model. The common methods are backwards elimination, forward selection, and stepwise regression.