Simple Linear Regression Intuition

simple linear regressionYou may recognize the equation for simple linear regression as the equation for a sloped line on an x and y axis.

Simple linear regression involves a dependent variable. This is an outcome you want to explain. For example, the dependent variable could represent salary. You could assume that level of experience will impact salary. So, you would label the independent variable as experience.

The coefficient can be thought of as a multiplier that connects the independent and dependent variables. It translates how much y will be affected by a unit change in x. In other words, a change in x does not usually mean an equal change in y.

In this example, the constant represents the lowest possible salary. This would be the starting salary for someone with a zero level of experience.

regression line of best fitSuppose you have data from a company’s employees. You could plot the data with the red marks as shown above. Then would draw a line that “best fits” the data. It’s impossible to connect all the marks with a straight line, so you use a best fitting line. The equation for this line would be the result of your simple linear regression. The regression finds the best fitting line. This line is your regression model.

How does Simple Linear Regression find the best fitting line?

The regression model is found by using the ordinary least squares method. Please refer to the following illustration.

ordinary least squares method First, look at the notation. Notice that the red mark is actual data, and the green mark is the predicted model. For these red and green marks in the boxed-frame, the actual salary is higher than what was predicted in the model. So, this employee makes a higher salary than what the model predicts. Of course, there are other variables besides experience that can affect salary. But in this case, we keep it simple. Hence the term, simple linear regression.

What the regression analysis does is take the sum of all the squared differences between the actual value and the predicted value. The analysis requires this to be done for the many different lines that can “fit” through the data. The line that has the minimum sum of squared differences compared to the other lines is the best fitting line. The equation for this line represent your simple linear regression model.

A detailed example of a regression model in the R programming language may help to understand this concept.

Machine Learning MOOC Instructor Shares Insights

Machine Learning MOOC instructor Hadelin de Ponteves is the primary instructor of Machine Learning A-Z on Udemy. This blog post is a quick summary of a podcast featuring Hadelin and his perspective on data science.

Before building a Machine Learning MOOC, Hadelin worked at Google and Canal+, which is the French competitor to Netflix. He stated that his biggest challenge in data science was to build a recommended system for Canal+. Recommended systems are based on an algorithm that suggest what movies for the user to watch.

He states he was able to quickly land a job after graduating from college. Many corporations are starting a data science team, so demand is high.

What is Machine Learning?

Machine Learning is a broad field. It can be used to predict the future. It can used to find an unknown. It covers many sub-fields, and can also be referred to as Artificial Intelligence. Essentially, it involves machines that learn how to do things.

Data science and machine learning go hand in hand. Linear regression is an example of data science that requires machine learning. Other two important areas are classification and clustering. Other sub-fields are association rule learning, reinforcement learning, deep learning, and natural language processing.

R and Python are most commonly used tools. They have great libraries for machine learning. Hadelin’s Machine Learning MOOC covers all sub-fields, and it gives example in both R and Python.

R vs. Python, Which Is Better?

The debate over R vs Python is fruitless. The fact is they are both widely used, and each one has its strengths. If you are new to data science, then you should get familiar with both languages. This is the best way to learn which tool you prefer. For example,  Hadelin prefers R for visualization. This is the tool he used at Canal+. For deep learning, he prefers Python.

Finally, Hadelin states the best way to learn are to solve challenges base on real world problems. His favorite book on the topic is Data Science for Business. Once you have a grasp of data science concepts, this book will add value to your understanding.