Introduction to Linear Regression in python

12 min readOct 21, 2020

Linear Regression is one of the most fundamental algorithms in the Machine Learning. But before proceeding with the algorithm, let’s first discuss the algorithm of any machine learning model. This flowchart explains the creation of a Machine Learning model from scratch and then taking the same model further with hyperparameter tuning to increase its accuracy, deciding the deployment strategies for that model. A typical lifecycle diagram for a machine learning model looks like

What is a Regression?

Regression in statistics is the process of predicting a Label(or Dependent Variable) based on the features(Independent Variables). Regression is used for time series modelling and finding the causal effect relationship between the variables and forecasting. For example, the relationship between the stock prices of the company and various factors like customer reputation and company annual performance etc. can be studied using regression. Regression analyses the relationship between two or more features

The benefits of using Regression analysis are as follows:

It shows the significant relationships between the label (dependent variable) and the features(independent variable).
It shows the extent of the impact of multiple independent variables on the dependent variable.
It can also measure these effects even if the variables are on a different scale.

Linear Regression

Linear Regression is one of the most fundamental and widely known Machine Learning Algorithms.

Building blocks of a Linear Regression Model are:

Discreet/continuous independent variables
A best-fit regression line
Continuous dependent variable. i.e., A Linear Regression model predicts the dependent variable using a regression line based on the independent variables. The equation of the Linear Regression is:

Y = m*X + c + e

Where, c is the intercept, b is the slope of the line, and e is the error term. The equation above is used to predict the value of the target variable based on the given predictor variable(s).

The Problem statement:

This data is about the amount spent on advertising through different channels like TV, Radio and Newspaper. The goal is to predict how the expense on each channel affects the sales and is there a way to optimize that sale?

# necessary Imports
import pandas as pd 
#for data manipulation and analysisimport matplotlib.pyplot as plt 
Matplotlib is a plotting library for the Python programming language and its numerical mathematics extension NumPy.import numpy as np 
#for high-level mathematical  functions to      operate on these arraysimport pickle
pickle module is used for serializing and de-serializing python object structures.% matplotlib inline

data= pd.read_csv('Advertising.csv') # Reading the data file

data.head() # checking the first five rows from the dataset

What are the features?(Independent Variables)

TV: Advertising dollars spent on TV for a single product in a given market (in thousands of dollars)
Radio: Advertising dollars spent on Radio
Newspaper: Advertising dollars spent on Newspaper

What is the response?(Dependent Variables)

Sales: sales of a single product in a given market (in thousands of widgets)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 5 columns):
Unnamed: 0    200 non-null int64
TV            200 non-null float64
radio         200 non-null float64
newspaper     200 non-null float64
sales         200 non-null float64
dtypes: float64(4), int64(1)
memory usage: 7.9 KB

Now, let’s showcase the relationship between the feature and target column

# visualize the relationship between the features and the response using scatterplotsfig, axs = plt.subplots(1, 3, sharey=True)
data.plot(kind='scatter',x='TV',y='sales',ax=axs[0], figsize=(16,8))
data.plot(kind='scatter', x='radio', y='sales', ax=axs[1])
data.plot(kind='scatter', x='newspaper', y='sales', ax=axs[2])

Questions about the data

A generic question shall be:

How the company should optimize the spends on advertising to maximize the sales?

These general questions might lead you to more specific questions:

What’s the relationship between ads and sales?
How prominent is that relationship?
Which ad types contribute to sales?
How each ad contributes to sales?
Can sales be predicted based on the expense of the advertisement?

We will explore these questions below!

From the relationship diagrams above, it can be observed that there seems to be a linear relationship between the features TV ad, Radio ad and the sales is almost a linear one. A linear relationship typically looks like:

Hence, we can build a model using the Linear Regression Algorithm.

Simple Linear Regression

Simple Linear regression is a method for predicting a quantitative response using a single feature (“input variable”). The mathematical equation is:

𝑦 = 𝛽0 + 𝛽1*𝑥

What do terms represent?

y is the response or the target variable
x is the feature
𝛽1 is the coefficient of x
β0 is the intercept

𝛽0 and 𝛽1 are the model coefficients. To create a model, we must “learn” the values of these coefficients. And once we have the value of these coefficients, we can use the model to predict the Sales!

The mathematics involved

Take a quick look at the plot created. Now consider each point, and know that each of them has a coordinate in the form (X, Y). Now,draw an imaginary line between each point and the current “best-fit” line. We’ll call the distance between each point and the current best-fit line as D. To get a quick image of what we’re trying to visualize, take a look at the picture below:

What elements are present in the diagram?

The red points are the observed values of x and y.
The blue line is the least squares line.
The green lines are the residuals, which is the distance between the observed values and the least squares line.

Let’s see the underlying assumptions: -

The regression model is linear in terms of coefficients and error term.
The mean of the residuals is zero.
The error terms are not correlated with each other, i.e. given an error value; we cannot predict the next error value.
The independent variables(x) are uncorrelated with the residual term, also termed as exogeneity. This, in layman term, generalizes that in no way should the error term be predicted given the value of independent variables.
The error terms have a constant variance, i.e. homoscedasticity.
No Multicollinearity, i.e. no independent variables should be correlated with each other or affect one another. If there is multicollinearity, the precision of prediction by the OLS model decreases.
The error terms are normally distributed.

The general equation of a straight line is:

𝑦=𝑚𝑥+𝑏

It means that if we have the value of m and b, we can predict all the values of y for corresponding x. During construction of a Linear Regression Model, the computer tries to calculate the values of m and b to get a straight line. But the question is:

How Do you Know this is the best fit line?

The best fit line is obtained by minimizing the residual. Residual is the distance between the actual Y and the predicted Y, as shown below:

Mathematically, Residual is: “𝑟=𝑦−(𝑚𝑥+𝑏)”

Hence, the sum of the square of residuals is:

As we can see that the residual is both a function of m and b, so differentiating partially with respect to m and b will give us:

For getting the best fit line, residual should be minimum. The minima of a function occurs where the derivative=0. So, equating our corresponding derivatives to 0, we get:

Ideally, if we’d have an equation of one dependent and one independent variable the minima will look as follows:

But as the residual’s minima is dependent on two variables m and b, it becomes a Paraboloid and the appropriate m and b are calculated using Gradient Descent as shown below:

The new values for ‘slope’ and ‘intercept’ are calculated as follows:

where, 𝜃0 is ‘intercept’ , 𝜃1 is the slope, α is the learning rate, m is the total number of observations and the term after the ∑ sign is the loss. Google Tensor board recommends a Learning rate between 0.00001 and 10. Generally a smaller learning rate is recommended to avoid overshooting while creating a model.

R² statistic

The R-squared statistic provides a measure of fit. It takes the form of a proportion — the proportion of variance explained — and so it always takes on a value between 0 and 1. In simple words, it represents how much of our data is being explained by our model. For example, R² statistic = 0.75, it says that our model fits 75 % of the total data set. Similarly, if it is 0, it means none of the data points is being explained and a value of 1 represents 100% data explanation. Mathematically R² statistic is calculated as :

Where RSS: is the Residual Sum of squares and is given as :

RSS is the residual(error) term we have been talking about so far. And, TSS: is the Total sum of squares and given as :

TSS is calculated when we consider the line passing through the mean value of y, to be the best fit line. Just like RSS, we calculate the error term when the best fit line is the line passing through the mean value of y and we get the value of TSS.

The closer the value of R² is to 1 the better the model fits our data. If R² comes below 0(which is a possibility) that means the model is so bad that it is performing even worse than the average best fit line.

# create X and y
feature_cols = ['TV']
X = data[feature_cols]
y = data.sales
# follow the usual sklearn pattern: import, instantiate, fit
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X, y)
# print intercept and coefficients
print(lm.intercept_)
print(lm.coef_)Out:[]
        7.032593549127693
        [0.04753664]

Interpreting the model

How do we interpret the coefficient for spends on TV ad (β1)?

A “unit” increase in spends on a TV ad is associated with a 0.047537 “unit” increase in Sales.
Or, an additional $1,000 on TV ads is translated to an increase in sales by 47.53 Dollars.

As an increase in TV ad expenditure is associated with a decrease in sales, 𝛽1 would be negative.

Prediction using the model

If the expense on TV ad is $50000, what will be the sales prediction for that market?

𝑦= 𝛽0 + 𝛽1𝑥

𝑦=7.032594+0.047537×50

#calculate the prediction
7.032594 + 0.047537*50Out[23]:9.409444

Thus, we would predict Sales of 9,409 widgets in that market.

Let’s do the same thing using code.

#  Let's create a DataFrame since the model expects it
X_new = pd.DataFrame({'TV': [50]})
X_new.head()Out[24]:
        TV
       ____
       0 50In []:
# use the model to make predictions on a new value
lm.predict(X_new)Out[25]:
array([9.40942557])

Plotting the Least Squares Line

In[]:
# create a DataFrame with the minimum and maximum values of TV
X_new = pd.DataFrame({'TV': [data.TV.min(), data.TV.max()]})
X_new.head()Out[26]:
         TV
       0   0.7
       1   296.4In []:
# make predictions for those x values and store thempreds = lm.predict(X_new)
predsOut[]: array([ 7.0658692 , 21.12245377])In []:
# first, plot the observed datadata.plot(kind='scatter', x='TV', y='sales')# then, plot the least squares lineplt.plot(X_new, preds, c='red', linewidth=2)

Out[ ]:

Model Confidence

Question: Is linear regression a low bias/high variance model or a high bias/low variance model?

Answer: It’s a High bias/low variance model. Even after repeated sampling, the best fit line will stay roughly in the same position (low variance), but the average of the models created after repeated sampling won’t do a great job in capturing the perfect relationship (high bias). Low variance is helpful when we don’t have less training data!

If the model has calculated a 95% confidence for our model coefficients, it can be interpreted as follows: If the population from which this sample is drawn, is sampled 100 times, then approximately 95 (out of 100) of those confidence intervals shall contain the “true” coefficients.

In []:
import statsmodels.formula.api as smf
lm = smf.ols(formula='sales ~ TV', data=data).fit()
lm.conf_int()Out[]:
              0                1
Intercept   6.129719        7.935468
TV          0.042231        0.052843

Keep in mind that we only have a single sample of data, and not the entire population of data. The “true” coefficient is either within this interval or it isn’t, but there’s no way actually to know. We estimate the coefficient with the data we do have, and we show uncertainty about that estimate by giving a range that the coefficient is probably within.

Note that using 95% confidence intervals is just a convention. You can create 90% confidence intervals (which will be more narrow), 99% confidence intervals (which will be wider), or whatever intervals you like.

Hypothesis Testing and p-values

Hypothesis testing is Closely related to confidence intervals. We start with a null hypothesis and an alternate hypothesis (that is opposite to the null). Then, we check whether the data rejects the null hypothesis or fails to reject the null hypothesis.

(“Failing to reject” the null hypothesis does not mean “accepting” the null hypothesis. The alternative hypothesis might indeed be true, but that we just don’t have enough data to prove that.)

The conventional hypothesis test is as follows:

Null hypothesis: No relationship exists between TV advertisements and Sales (and hence 𝛽1 equals zero).
Alternative hypothesis: There exists a relationship between TV advertisements and Sales (and hence, 𝛽1 is not equal to zero).

How do we test this? We reject the null hypothesis (and thus believe the alternative hypothesis) if the 95% confidence interval does not include zero. The p-value represents the probability of the coefficient actually being zero.

In [30]:

# print the p-values for the model coefficients
lm.pvaluesOut[30]:Intercept    1.406300e-35
TV           1.467390e-42
dtype:       float64

If the 95% confidence interval includes zero, the p-value for that coefficient will be greater than 0.05. If the 95% confidence interval does not include zero, the p-value will be less than 0.05.

Thus, a p-value of less than 0.05 is a way to decide whether there is any relationship between the feature in consideration and the response or not. Using 0.05 as the cutoff is just a convention.

In this case, the p-value for TV ads is way less than 0.05, and so we believe that there is a relationship between TV advertisements and Sales.

Note that we generally ignore the p-value for the intercept.

How Well Does the Model Fit the data?

One of the most generic way to evaluate the fit of a linear model is by computing the R-squared value. R-squared explains theproportion of variance, i.e., the proportion of variance in the observed data which the model explains, or the reduction in error over the null model. (A null model only predicts the mean of all the observed responses, and thus it only has an intercept and no slope.)

The value of R-squared lies between 0 and 1. A value closer to 1 is better as it means that more variance is explained by the model.

In []:
# print the R-squared value for the model
lm.rsquared
Out[]:  0.611875050850071

Is it a “good” R-squared value? Now, that’s hard to say. In reality, the domain to which the data belongs to plays a significant role in deciding the threshold for the R-squared value. Therefore, it’s a tool for comparing different models.