\( \) \( \) \( \) \( \)
The simple linear regression model is presented with examples examples , problems and their solutions.
Examples of simple linear regression with real life data and multiple linear regression are also included.
Let us assume that we have a set of ordered pairs \( (x_i , y_i) \) where \( x_i \) is the independent observed variable and \( y_i \) is the corresponding dependent observed variable with a scatter plot shown below.
The correlation between \( x \) and \( y \) informs us on the strength of the relationship between \( x \) and \( y \), however in order to make prediction, we sometimes to go further and establish a relationship in the form of an equation betwen \( x \) and \( y \).
If the relationship between the independent observed variable \( x \) and the dependent observed variable \( y \) is close to a linear one, then the simple theoretical linear model may be written written as
\[ y = \beta_0 + \beta_1 x + \epsilon \]
\( y \) is the dependent variable that we wish to predict for values of \( x \) not included in the observed data values. \( \epsilon \) is the error or difference between the observed (or
measured) dependent variable \( y_i \) at some value of \( x_i \) and the predicted variable \( y \).
The line in the graph below is that of the linear equation \( y = \beta_0 + \beta_1 x \) whose \( y \) intercept is \( \beta_0 \) and slope \( \beta_1 \).
The graph also shows that \( \epsilon_i \) is the difference between the observed value of the dependent variable \( y_i \) and the value of \( y \) given by the equation \( y = \beta_0 + \beta_1 x \) at \( x = x_i \).
The simple theoretical linear model is valid if:
So far we have dealt with a theoretical model.
Question: Given a set of observed values \( y_i \) and \( x_i \), what are the values of the y intercept \( \beta_0 \) and the slope \( \beta_1 \) that would give a good linear model as described above?
For \( m \) data points \( (x_i , y_i) \), the sum of squares of all the erros \( \epsilon_i \) is given by
\( SSE = \sum_{i=1}^{m} \epsilon_i^2 \)
with \( \epsilon_i = y_i - \hat y_i \)
where \( \hat y_i = \beta_0 + \beta_1 x_i \)
and therefore
\( SSE = \sum_{i=1}^{m} (y_i - \hat y_i )^2 = \sum_{i=1}^{m} (y_i - \beta_0 - \beta_1 x_i )^2 \)
The method of least square [1] is used to find the coefficients \( \beta_0 \) and the slope \( \beta_1 \)
From calclus [2] , \( SSE \) has a minimum value when
\( \dfrac{\partial (SSE)}{\partial \beta_0} = 0 \quad \) and \( \quad \dfrac{\partial (SSE) }{\partial \beta_1} = 0 \)
\( \dfrac{\partial (SSE)}{\partial \beta_0} = - 2 \sum_{i=1}^{m} (y_i - \beta_0 - \beta_1 x_i) \)
\(\dfrac{\partial (SSE) }{\partial \beta_1} = - 2 \sum_{i=1}^{m} x_i (y_i - \beta_0 - \beta_1 x_i) \)
which gives two equations with unknowns \( \beta_0 \) and \( \beta_1 \)
\( - 2 \sum_{i=1}^{m} (y_i - \beta_0 - \beta_1 x_i) = 0 \quad (I) \)
\(- 2 \sum_{i=1}^{m} x_i (y_i - \beta_0 - \beta_1 x_i) = 0 \quad (II) \)
Divide both sides of equations (I) and (II) by \( -2 \) and rewrite them with terms containing the unknowns \( \beta_0 \) and \( \beta_1 \) on the left.
\( \sum_{i=1}^{m} ( \beta_0 + \beta_1 x_i ) = \sum_{i=1}^{m} y_i \)
\( \sum_{i=1}^{m} (\beta_0 x_i + \beta_1 x_i) = \sum_{i=1}^{m} x_i y_i \)
Distribute the sums to obtain
\( \sum_{i=1}^{m} \beta_0 + \beta_1 \sum_{i=1}^{m} x_i = \sum_{i=1}^{m} y_i \)
\( \beta_0 \sum_{i=1}^{m} x_i + \beta_1 \sum_{i=1}^{m} x_i^2 = \sum_{i=1}^{m} x_i y_i \)
The above equations may be simplified to
\( m \beta_0 + \beta_1 \sum_{i=1}^{m} x_i = \sum_{i=1}^{m} y_i \quad (I')\)
\( \beta_0 \sum_{i=1}^{m} x_i + \beta_1 \sum_{i=1}^{m} x_i^2 = \sum_{i=1}^{m} x_i y_i \quad (I") \)
Use Cramer's rule on equations (I') and (II") to find \( \beta_1 \)
\( \beta_1 = \dfrac{m \sum_{i=1}^{m} x_i y_i - \sum_{i=1}^{m} y_i \sum_{i=1}^{m} x_i }{m \sum_{i=1}^{m} x_i^2 - (\sum_{i=1}^{m} x_i )^2} \)
Divide the numerator and denominator of the above rational expression by \( m \) to obtain
\( \beta_1 = \dfrac{\sum_{i=1}^{m} x_i y_i - \dfrac{\sum_{i=1}^{m} y_i \sum_{i=1}^{m} x_i }{m}} { \sum_{i=1}^{m} x_i^2 - \dfrac{(\sum_{i=1}^{m} x_i )^2}{m} } \)
Use equation \( (I')\) to write
\( \beta_0 = \dfrac{\sum_{i=1}^{m} y_i}{m} - \beta_1 \dfrac{ \sum_{i=1}^{m} x_i}{m} \)
Let \( \bar x = \dfrac {\sum x_i}{n} \) and \( \bar y = \dfrac {\sum y_i}{n} \) and write
\( \beta_0 = \bar y - \beta_1 \bar x \)
We define the sums of squares as
\( SS_x = \sum (x_i - \bar x)^2 \)
\( SS_y = \sum (y_i - \bar y)^2 \)
and the sum of cross product as
\( SS_{xy} = \sum (x_i - \bar x) (y_i - \bar y) \)
Develop and simplify \( SS_x \)
\( SS_x = \sum (x_i - \bar x)^2 = \sum (x_i^2 + \bar x^2 - 2 x_i \bar x) \\
= \sum x_i^2 + m \bar x^2 - 2 \bar x \sum x_i \\
= \sum x_i^2 + m \bar x^2 - 2 n \bar x^2 \\
= \sum x_i^2 - m \bar x^2 \\
= \sum x_i^2 - \dfrac{(\sum x_i)^2}{m} \)
Similarly, it can also be proved that
\( SS_y = \sum y_i^2 - \dfrac{(\sum y_i)^2}{m} \)
\( SS_{xy} = \sum x_i y_i - \dfrac{\sum x_i \sum y_i}{m} \)
We can now rerewrite \( \beta_1 \) and \( \beta_0 \) as
\[ \hat \beta_1 = \dfrac{SS_{xy}}{SS_{x}} \]
\[ \hat \beta_0 = \bar y - \hat \beta_1 \bar x \]
Note the symbol "hat" symbol \( \hat \cdot \) is used to indicate that \( \hat \beta_1 \) and \( \hat \beta_0 \) make the sum of errors \( \sum_{i=1}^{m} \epsilon_i^2 \) minimum.
Example 1
Given the data in the table below,
x | y |
---|---|
| |
0 | 1 |
2 | 5 |
4 | 9 |
5 | 11 |
7 | 15 |
Example 2
Given the data in the table below,
x | y |
---|---|
| |
-2 | 3 |
1 | 1 |
5 | - 2 |
6 | -5 |
8 | - 6 |
Example 3
Given the data in the table below,
x | y |
---|---|
| |
0 | 3 |
1 | -2 |
5 | 1 |
6 | -5 |
8 | 4 |
Example 4
The data sets in examples 1 and 2 above are shown in the tables below,
a)
x | y |
---|---|
| |
0 | 1 |
2 | 5 |
4 | 9 |
5 | 11 |
7 | 15 |
x | y |
---|---|
| |
-2 | 3 |
1 | 1 |
5 | - 2 |
6 | -5 |
8 | - 6 |
Problem 1
For each data set below,
A) make a scatter plot,
B) calculate the correlation coefficient using Excel or any other software applications such as Google Sheets, LibreOffice ...
C) Calculate the Coefficient of Determination using Excel or any other software applications.
D) decide which data set(s) may be appropriately modelled using simple linear regression model and find \( \hat \beta_0 \) and \( \hat \beta_1 \) using Excel or any other software applications such as Google Sheets, LibreOffice...
Problem 2
Given the data set below,
A) make a scatter plot,
B) calculate the correlation coefficient using Excel or any other software applications such as Google Sheets, LibreOffice ...
C) Use Excel to determine the coefficients \( \hat \beta_0 \) and \( \hat \beta_1 \) involved in the linear regression.
D) Use the linear regression model to predict the value of \( y \) for \( x = 1.02 \)
Solution to Problem 1
A)
The scatter plots of each data set is shown below
B)
The correlation of each dataset is shown below and calculated using Excel "Correlation Function"(red) and Excel "Data Analysis" tools(blue).
C)
Using Excel for simple linear regression to each dataset, we obtain the following results where \( r^2 \) is the coefficient of determination.
D)
In part D) above we found the correlation coefficients and we can deduce that absolute values of the correlations of datasets a) and c) are close to 1. Also their factors of determination \( r^2 \) are also close to 1. In fact we can check that, for each dataset, the square of the correlation coefficient is equal to the coefficient of determination.
Hence both datasets a) and c) may be modeled using a linear regression model. Both the correlation and the coefficient of determination of the dataset c) are close to zero and therefore a linear regression model for this data set would not be a good one.
Solution to Problem 2
A)
The scatter plots of each data set is shown below
B)
Using Excel, the correlation is given by
C)
The use of Excel gives the following results
D)
\( \hat y = \hat \beta_1 x + \hat \beta_0 \\
1.596447 x + 0.758926 \)
Substitute \( x \) by its numerical value \( 1.02 \)
\( \hat y = 1.596447 \times 1.02 + 0.758926 = 2.38730194 \)