2.3. Extensions/limitations of linear regression

Read then Launch

This content is best viewed in html because jupyter notebook cannot display some content (e.g. figures, equations) properly. You should finish reading this page first and then launch it as an interactive notebook in Google Colab (faster, Google account needed) or Binder by clicking the rocket symbol () at the top.

This section studies how to extend linear regression to more complicated types of variables and relationships, and the limitations of linear regression. We will use three datasets to illustrate these concepts:

  1. The Advertising dataset that has been used in the previous sections.

  2. The Credit dataset contains information about credit card holders. The goal is to predict the credit Limit for each card holder.

  3. The Auto dataset dataset contains information about the characteristics of cars. The goal is to predict the mpg (miles per gallon) for a car based on its features like horsepower, weight and acceleration. This dataset has also been used in the tutorial of Prerequisites.

You are encouraged to click the links to the datasets and explore the variables and understand their relationships.

2.3.1. Import libraries and load data

Get ready by importing the APIs needed from respective libraries.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import scale
from sklearn.linear_model import LinearRegression

from statsmodels.formula.api import ols

%matplotlib inline

Load the Advertising dataset dataset that we are already familiar with.

data_url = "https://github.com/pykale/transparentML/raw/main/data/Advertising.csv"
advertising_df = pd.read_csv(data_url, header=0, index_col=0)

Load the Credit dataset dataset, convert the Student2 category to numbers (‘0’ and ‘1’), and inspect the first three rows.

credit_url = "https://github.com/pykale/transparentML/raw/main/data/Credit.csv"

credit_df = pd.read_csv(credit_url)
credit_df["Student2"] = credit_df.Student.map({"No": 0, "Yes": 1})
credit_df.head(3)
Income Limit Rating Cards Age Education Own Student Married Region Balance Student2
0 14.891 3606 283 2 34 11 No No Yes South 333 0
1 106.025 6645 483 3 82 15 Yes Yes Yes West 903 1
2 104.593 7075 514 4 71 11 No No No West 580 0

Load the Auto dataset datasets, ignore rows with missing values, and inspect the columns, counts and data types.

auto_url = "https://github.com/pykale/transparentML/raw/main/data/Auto.csv"

auto_df = pd.read_csv(auto_url, na_values="?").dropna()
auto_df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 392 entries, 0 to 396
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           392 non-null    float64
 1   cylinders     392 non-null    int64  
 2   displacement  392 non-null    float64
 3   horsepower    392 non-null    float64
 4   weight        392 non-null    int64  
 5   acceleration  392 non-null    float64
 6   year          392 non-null    int64  
 7   origin        392 non-null    int64  
 8   name          392 non-null    object 
dtypes: float64(4), int64(4), object(1)
memory usage: 30.6+ KB

2.3.2. Qualitative predictors

In the previous sections, the predictor variables are quantitative. Here, in the Credit dataset, the response is Balance (average credit card debt for each individual) and there are six quantitative predictors: Age , Cards (number of credit cards), Education (years of education), Income (in thousands of dollars), Limit (credit limit), and Rating (credit rating). The following code displays the pairwise relationships between these variables (Figure 3.6 in the textbook).

sns.pairplot(
    credit_df[["Balance", "Age", "Cards", "Education", "Income", "Limit", "Rating"]]
)
plt.show()
../_images/extension-limitation_9_0.png

However, in practice, predictor variables are not always quantitative. For example, in the Credit dataset, it also contains four qualitative (categorical) variables: Student (Yes or No), Own (Yes or No), Married (Yes or No), and Region (South, East, or West).

Predictors with only two levels

If a qualitative predictor has only two levels (possible values), then we can treat it as a quantitative variable, known as an indicator or dummy variable. For example, the Student variable has two levels: Yes and No. We can convert these levels to numbers, such as 1 and 0, and then treat the Student variable as a quantitative variable. This is called coding the qualitative variable, which is automatically done by the statsmodels library.

Run the following code to get the least squares coefficient estimates associated with the regression of Balance onto Own in the Credit dataset (Table 3.7 in the textbook).

est = ols("Balance ~ Own", credit_df).fit()
est.summary().tables[1]
coef std err t P>|t| [0.025 0.975]
Intercept 509.8031 33.128 15.389 0.000 444.675 574.931
Own[T.Yes] 19.7331 46.051 0.429 0.669 -70.801 110.267

The large \(p\) value above shows that there is no relationship between Balance and Own.

Predictors with more than two levels

When a qualitative predictor has more than two levels, we need to create additional dummy variables. For example, the Region variable has three levels: South, East, and West. We can create two dummy variables, South, and West, and then use these variables in the regression model. There will be always one fewer dummy variable than the number of levels in the qualitative variable. This is because the regression model can determine the value of the missing dummy variable. For example, if South is 0 and West is 0, then the individual must be from the East region. Again, such coding is automatically done by the statsmodels library.

Run the following code to get the least squares coefficient estimates associated with the regression of Balance onto Region in the Credit dataset (Table 3.8 in the textbook).

est = ols("Balance ~ Region", credit_df).fit()
est.summary().tables[1]
coef std err t P>|t| [0.025 0.975]
Intercept 531.0000 46.319 11.464 0.000 439.939 622.061
Region[T.South] -12.5025 56.681 -0.221 0.826 -123.935 98.930
Region[T.West] -18.6863 65.021 -0.287 0.774 -146.515 109.142

The large \(p\)-values above show that there is no relationship between Balance and Region.

2.3.3. Extensions of linear regression

2.3.3.1. Interaction between variables

In the previous analysis of Advertising dataset, the predictor variables are assumed to be independent. However, this model may be incorrect. For example, radio advertising can increase the effectiveness of TV advertising, which is known as interaction effect in statistics. Consider the standard multiple linear regression with two variables

\[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \epsilon. \]

This model can be extended to include an interaction term between \(x_1\) and \(x_2\):

(2.22)\[\begin{align} y &= \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_1 x_2 + \epsilon \\ &= \beta_0 + (\beta_1 + \beta_3 x_2) x_1 + \beta_2 x_2 + \epsilon \\ &= \beta_0 + \tilde{\beta}_1 x_1 + \beta_2 x_2 + \epsilon, \end{align}\]

where \(\tilde{\beta}_1 = \beta_1 + \beta_3 x_2\) is the effective coefficient on \(x_1\), which can be viewed as another linear regression that regresses \(\tilde{\beta}_1\) onto \(x_2\). Then the association between \(y\) and \(x_1\) is no longer constant and depends on the value of \(x_2\). Similarly, the association between \(y\) and \(x_2\) is no longer constant and depends on the value of \(x_1\).

Run the following code for the Advertising data to get the least squares coefficient estimates associated with the regression of Sales onto TV and Radio, with an interaction term, TV \(\times\) Radio (Table 3.9 of the text book).

est = ols("Sales ~ TV + Radio + TV*Radio", advertising_df).fit()
est.summary().tables[1]
coef std err t P>|t| [0.025 0.975]
Intercept 6.7502 0.248 27.233 0.000 6.261 7.239
TV 0.0191 0.002 12.699 0.000 0.016 0.022
Radio 0.0289 0.009 3.241 0.001 0.011 0.046
TV:Radio 0.0011 5.24e-05 20.727 0.000 0.001 0.001

The small \(p\)-value for the interaction term suggests that the true relationship is not additive and there is strong evidence of the association between Sales and the interaction between TV and Radio. Without including the interaction term, our understanding of the relationship between Sales and TV and Radio is not complete.

2.3.3.1.1. Interaction between qualitative and quantitative variables

The concept of interactions also applies to qualitative variables. In fact, sometimes an interaction between a qualitative variable and a quantitative variable has a particularly nice interpretation. For example, run the following code to predict Balance using the Income (quantitative) and Student (qualitative) variables, and compare the differences between including and excluding the interaction term between Income and Student.

est1 = ols("Balance ~ Income + Student2", credit_df).fit()
est2 = ols("Balance ~ Income + Income*Student2", credit_df).fit()

print("Regression 1 - without interaction term")
print(est1.params)
print("\nRegression 2 - with interaction term")
print(est2.params)
Regression 1 - without interaction term
Intercept    211.142964
Income         5.984336
Student2     382.670539
dtype: float64

Regression 2 - with interaction term
Intercept          200.623153
Income               6.218169
Student2           476.675843
Income:Student2     -1.999151
dtype: float64

We can see that the interaction term between Income and Student affects the relationship between Balance and Student more than the relationship between Balance and Income. To have a closer look, let us visualise the relationship between Balance and Income for students and non-students separately (Figure 3.7 of the textbook).

# Income (x-axis)
regr1 = est1.params
regr2 = est2.params

income = np.linspace(0, 150)

# Balance without interaction term (y-axis)
student1 = np.linspace(
    regr1["Intercept"] + regr1["Student2"],
    regr1["Intercept"] + regr1["Student2"] + 150 * regr1["Income"],
)
non_student1 = np.linspace(
    regr1["Intercept"], regr1["Intercept"] + 150 * regr1["Income"]
)

# Balance with interaction term (y-axis)
student2 = np.linspace(
    regr2["Intercept"] + regr2["Student2"],
    regr2["Intercept"]
    + regr2["Student2"]
    + 150 * (regr2["Income"] + regr2["Income:Student2"]),
)
non_student2 = np.linspace(
    regr2["Intercept"], regr2["Intercept"] + 150 * regr2["Income"]
)

# Create plot
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))
ax1.plot(income, student1, "r-", income, non_student1, "k-")
ax2.plot(income, student2, "r--", income, non_student2, "k--")

for ax in fig.axes:
    ax.legend(["student", "non-student"], loc=2)
    ax.set_xlabel("Income")
    ax.set_ylabel("Balance")
    ax.set_ylim(ymax=1550)
../_images/extension-limitation_22_0.png

In the left figure without the interaction term between Income and Student, we see two (solid) parallel lines, one for students and one for non-students. The two lines have different intercepts but the same slope, indicating that a one-unit increase in Income will have the same effect on Balance for students and non-students, which is counterintuitive.

In the right figure with the interaction term between Income and Student, the two (dashed) lines are no longer parallel. We see that the slope of the line for students is flatter than that for non-students, indicating that a one-unit increase in Income will have a much smaller effect on credit card Balance for students than for non-students. This is consistent with our intuition.

2.3.3.2. Nonlinear relationships

In some cases, the true relationship between the response and the predictors may be nonlinear. Here we present a very simple way to directly extend the linear model to accommodate nonlinear relationships, using polynomial regression.

Run the following code to observe the relationship between mpg (gas mileage in miles per gallon) and horsepower.

plt.scatter(
    auto_df.horsepower, auto_df.mpg, facecolors="None", edgecolors="k", alpha=0.5
)
plt.xlabel("horsepower")
plt.ylabel("mpg")
Text(0, 0.5, 'mpg')
../_images/extension-limitation_25_1.png

The figure shows a strong relationship between mpg and horsepower, but it is clearly not linear. We can use a polynomial regression to fit a nonlinear model to the data. We can fit a polynomial regression of degree 2 (quadratic) to the data using the following regression:

(2.23)\[\begin{equation} \textrm{mpg} = \beta_0 + \beta_1 \times \textrm{horsepower} + \beta_2 \times \textrm{horsepower}^2 + \epsilon. \end{equation}\]

Run the following code to display the relationship between mpg (gas mileage in miles per gallon) and \(\textrm{horsepower}^2\) as well as \(\textrm{horsepower}^5\) (Figure 3.8 of the textbook) for cars in the Auto dataset.

# With Seaborn's regplot() you can easily plot higher order polynomials.
plt.scatter(
    auto_df.horsepower, auto_df.mpg, facecolors="None", edgecolors="k", alpha=0.5
)

sns.regplot(
    x=auto_df.horsepower,
    y=auto_df.mpg,
    ci=None,
    label="Linear",
    scatter=False,
    color="orange",
)
sns.regplot(
    x=auto_df.horsepower,
    y=auto_df.mpg,
    ci=None,
    label="Degree 2",
    order=2,
    scatter=False,
    color="lightblue",
)
sns.regplot(
    x=auto_df.horsepower,
    y=auto_df.mpg,
    ci=None,
    label="Degree 5",
    order=5,
    scatter=False,
    color="g",
)
plt.legend()
plt.ylim(5, 55)
plt.xlim(40, 240);
../_images/extension-limitation_27_0.png

The above figure seems to suggest that a quadratic relationship between mpg and horsepower is more appropriate than a linear relationship.

In fact, this is still a linear model with \(x_1 = \textrm{horsepower}\) and \(x_2 = \textrm{horsepower}^2\). Run the following code to create a new predictor horsepower2 and inspect the Auto dataset.

auto_df["horsepower2"] = auto_df.horsepower**2
auto_df.head(3)
mpg cylinders displacement horsepower weight acceleration year origin name horsepower2
0 18.0 8 307.0 130.0 3504 12.0 70 1 chevrolet chevelle malibu 16900.0
1 15.0 8 350.0 165.0 3693 11.5 70 1 buick skylark 320 27225.0
2 18.0 8 318.0 150.0 3436 11.0 70 1 plymouth satellite 22500.0

Then run the following code to fit a linear regression model with mpg as the response and horsepower and horsepower2 as the predictors (Table 3.10 of the textbook).

est = ols("mpg ~ horsepower + horsepower2", auto_df).fit()
est.summary().tables[1]
coef std err t P>|t| [0.025 0.975]
Intercept 56.9001 1.800 31.604 0.000 53.360 60.440
horsepower -0.4662 0.031 -14.978 0.000 -0.527 -0.405
horsepower2 0.0012 0.000 10.080 0.000 0.001 0.001

2.3.3.3. Limitations of linear regression

Every model will have some limitations, and linear regression is no exception. Here we briefly study two of the limitations of linear regression to show the possible complexities of real-world data. More limitations can be found in the textbook.

2.3.3.3.1. Nonlinearity of the Data

The linear regression model assumes that there is a “straight-line” relationship between the predictors and the response. If the true relationship is far from linear, then virtually all of the conclusions that we draw from the fit are questionable. In addition, the prediction accuracy of the model can be significantly reduced.

Run the following code to compare the differences in accuracy between a linear regression model and a quadratic regression model for regressing mpg onto horsepower.

Firstly, fitting a linear regression model and a quadratic regression model for regressing mpg onto horsepower and compute the residuals for each model.

regr = LinearRegression()

# Linear fit
X = auto_df.horsepower.values.reshape(-1, 1)
y = auto_df.mpg
regr.fit(X, y)

auto_df["pred1"] = regr.predict(X)
auto_df["resid1"] = auto_df.mpg - auto_df.pred1

# Quadratic fit
X2 = auto_df[["horsepower", "horsepower2"]].values
regr.fit(X2, y)

auto_df["pred2"] = regr.predict(X2)
auto_df["resid2"] = auto_df.mpg - auto_df.pred2

Create plots of the residuals for each model. What do you observe?

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

# Left plot
sns.regplot(
    x=auto_df.pred1,
    y=auto_df.resid1,
    lowess=True,
    ax=ax1,
    line_kws={"color": "r", "lw": 1},
    scatter_kws={"facecolors": "None", "edgecolors": "k", "alpha": 0.5},
)
ax1.hlines(
    0,
    xmin=ax1.xaxis.get_data_interval()[0],
    xmax=ax1.xaxis.get_data_interval()[1],
    linestyles="dotted",
)
ax1.set_title("Residual Plot for Linear Fit")

# Right plot
sns.regplot(
    x=auto_df.pred2,
    y=auto_df.resid2,
    lowess=True,
    line_kws={"color": "r", "lw": 1},
    ax=ax2,
    scatter_kws={"facecolors": "None", "edgecolors": "k", "alpha": 0.5},
)
ax2.hlines(
    0,
    xmin=ax2.xaxis.get_data_interval()[0],
    xmax=ax2.xaxis.get_data_interval()[1],
    linestyles="dotted",
)
ax2.set_title("Residual Plot for Quadratic Fit")

for ax in fig.axes:
    ax.set_xlabel("Fitted values")
    ax.set_ylabel("Residuals")
../_images/extension-limitation_35_0.png

2.3.3.4. Collinearity

Collinearity refers to the situation in which two or more predictor variables are closely related to one another.

Run the following code to illustrate the problem of collinearity in the Credit dataset (Figure 3.14 of the textbook).

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

# Left plot
ax1.scatter(credit_df.Limit, credit_df.Age, facecolor="None", edgecolor="r")
ax1.set_ylabel("Age")

# Right plot
ax2.scatter(credit_df.Limit, credit_df.Rating, facecolor="None", edgecolor="r")
ax2.set_ylabel("Rating")

for ax in fig.axes:
    ax.set_xlabel("Limit")
    ax.set_xticks([2000, 4000, 6000, 8000, 12000])
../_images/extension-limitation_38_0.png

In the left panel, the two predictors limit and age appear to have no obvious relationship. In contrast, in the right panel, the predictors limit and rating are very highly correlated with each other, and we say that they are collinear. In this context, limit and rating tend to increase or decrease together, it can be difficult to determine how each one separately is associated with the response balance.

Run the following code to learn the difficulties that can result from collinearity in the context of the Credit dataset.

y = credit_df.Balance

# Regression for left plot
X = credit_df[["Age", "Limit"]].values
regr1 = LinearRegression()
regr1.fit(scale(X.astype("float"), with_std=False), y)
print("Age/Limit\n", regr1.intercept_)
print(regr1.coef_)

# Regression for right plot
X2 = credit_df[["Rating", "Limit"]].values
regr2 = LinearRegression()
regr2.fit(scale(X2.astype("float"), with_std=False), y)
print("\nRating/Limit\n", regr2.intercept_)
print(regr2.coef_)
Age/Limit
 520.0150000000001
[-2.29148553  0.17336497]

Rating/Limit
 520.015
[2.20167217 0.02451438]

Create grid coordinates for plotting and then calculate RSS based on grid of coefficients

# grid coordinates
beta_age = np.linspace(regr1.coef_[0] - 3, regr1.coef_[0] + 3, 100)
beta_limit = np.linspace(regr1.coef_[1] - 0.02, regr1.coef_[1] + 0.02, 100)

beta_rating = np.linspace(regr2.coef_[0] - 3, regr2.coef_[0] + 3, 100)
beta_limit2 = np.linspace(regr2.coef_[1] - 0.2, regr2.coef_[1] + 0.2, 100)

X1, Y1 = np.meshgrid(beta_limit, beta_age, indexing="xy")
X2, Y2 = np.meshgrid(beta_limit2, beta_rating, indexing="xy")
Z1 = np.zeros((beta_age.size, beta_limit.size))
Z2 = np.zeros((beta_rating.size, beta_limit2.size))

limit_scaled = scale(credit_df.Limit.astype("float"), with_std=False)
age_scaled = scale(credit_df.Age.astype("float"), with_std=False)
rating_scaled = scale(credit_df.Rating.astype("float"), with_std=False)

# calculate RSS
for (i, j), v in np.ndenumerate(Z1):
    Z1[i, j] = (
        (y - (regr1.intercept_ + X1[i, j] * limit_scaled + Y1[i, j] * age_scaled)) ** 2
    ).sum() / 1000000

for (i, j), v in np.ndenumerate(Z2):
    Z2[i, j] = (
        (y - (regr2.intercept_ + X2[i, j] * limit_scaled + Y2[i, j] * rating_scaled))
        ** 2
    ).sum() / 1000000

Plot the contours of the RSS with respect to the coefficients of the two linear regression models in 2D.

from matplotlib.pyplot import contour


fig = plt.figure(figsize=(12, 5))
fig.suptitle("RSS - Regression coefficients", fontsize=20)

ax1 = fig.add_subplot(121)
ax2 = fig.add_subplot(122)

min_RSS = r"$\beta_0$, $\beta_1$ for minimized RSS"

# Left plot
contour1 = ax1.contour(X1, Y1, Z1, cmap=plt.cm.Set1, levels=[21.25, 21.5, 21.8])
ax1.scatter(regr1.coef_[1], regr1.coef_[0], c="r", label=min_RSS)
ax1.clabel(contour1, inline=True, fontsize=10, fmt="%1.1f")
ax1.set_ylabel(r"$\beta_{Age}$", fontsize=17)

# Right plot
contour2 = ax2.contour(X2, Y2, Z2, cmap=plt.cm.Set1, levels=[21.5, 21.8])
ax2.scatter(regr2.coef_[1], regr2.coef_[0], c="r", label=min_RSS)
ax2.clabel(contour2, inline=True, fontsize=10, fmt="%1.1f")
ax2.set_ylabel(r"$\beta_{Rating}$", fontsize=17)
ax2.set_xticks([-0.1, 0, 0.1, 0.2])

for ax in fig.axes:
    ax.set_xlabel(r"$\beta_{Limit}$", fontsize=17)
    ax.legend()
../_images/extension-limitation_44_0.png

Left: A contour plot of RSS for the regression of balance onto age and limit. The minimum value is well defined.

Right: A contour plot of RSS for the regression of balance onto rating and limit. Because of the collinearity, there are many pairs (\(\beta_{\text{Limit}}\) , \(\beta_{\text{Rating}}\)) with a similar value for RSS.

Collinearity reduces the accuracy of the estimates of the regression coefficients, it causes the standard error for \(\hat{\beta}_j\) to grow. In the presence of collinearity, we may fail to reject \(H_0\) : \(\hat{\beta}_j = 0\). This means that the power of the hypothesis test—the probability of correctly detecting a non-zero coefficient is reduced by collinearity.

Run the following code to compare the coefficient estimates obtained from two separate multiple regression models using statsmodels, where the first model is a regression of balance on age and limit, and the second model is a regression of balance on rating and limit.

est_age_limit = ols("Balance ~ Age + Limit", credit_df).fit()
est_age_limit.summary().tables[1]
coef std err t P>|t| [0.025 0.975]
Intercept -173.4109 43.828 -3.957 0.000 -259.576 -87.246
Age -2.2915 0.672 -3.407 0.001 -3.614 -0.969
Limit 0.1734 0.005 34.496 0.000 0.163 0.183
est_rating_limit = ols("Balance ~ Rating + Limit", credit_df).fit()
est_rating_limit.summary().tables[1]
coef std err t P>|t| [0.025 0.975]
Intercept -377.5368 45.254 -8.343 0.000 -466.505 -288.569
Rating 2.2017 0.952 2.312 0.021 0.330 4.074
Limit 0.0245 0.064 0.384 0.701 -0.101 0.150

The above generates Table 3.11 of the textbook.

2.3.4. Exercise

1. All the following exercises involve the use of the Boston dataset. We will now try to predict per capita crime rate using the other variables in this dataset. In other words, per capita crime rate is the response, and the other variables are the predictors.

For each predictor, fit a simple linear regression model to predict the response. Don’t forget to handle any NULL values. Hint: See section 2.2.2.

# Write your code below to answer the question

Compare your answer with the reference solution below

from statsmodels.formula.api import ols
import pandas as pd

boston_url = "https://github.com/pykale/transparentML/raw/main/data/Boston.csv"

boston_df = pd.read_csv(boston_url).dropna()


models_a = [
    ols(formula="crim ~ {}".format(f), data=boston_df).fit()
    for f in boston_df.columns[1:]
]

for model in models_a:
    print(model.summary().tables[1])
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      4.4537      0.417     10.675      0.000       3.634       5.273
zn            -0.0739      0.016     -4.594      0.000      -0.106      -0.042
==============================================================================
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -2.0637      0.667     -3.093      0.002      -3.375      -0.753
indus          0.5098      0.051      9.991      0.000       0.410       0.610
==============================================================================
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      3.7444      0.396      9.453      0.000       2.966       4.523
chas          -1.8928      1.506     -1.257      0.209      -4.852       1.066
==============================================================================
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    -13.7199      1.699     -8.073      0.000     -17.059     -10.381
nox           31.2485      2.999     10.419      0.000      25.356      37.141
==============================================================================
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     20.4818      3.364      6.088      0.000      13.872      27.092
rm            -2.6841      0.532     -5.045      0.000      -3.729      -1.639
==============================================================================
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -3.7779      0.944     -4.002      0.000      -5.633      -1.923
age            0.1078      0.013      8.463      0.000       0.083       0.133
==============================================================================
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      9.4993      0.730     13.006      0.000       8.064      10.934
dis           -1.5509      0.168     -9.213      0.000      -1.882      -1.220
==============================================================================
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -2.2872      0.443     -5.157      0.000      -3.158      -1.416
rad            0.6179      0.034     17.998      0.000       0.550       0.685
==============================================================================
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -8.5284      0.816    -10.454      0.000     -10.131      -6.926
tax            0.0297      0.002     16.099      0.000       0.026       0.033
==============================================================================
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    -17.6469      3.147     -5.607      0.000     -23.830     -11.464
ptratio        1.1520      0.169      6.801      0.000       0.819       1.485
==============================================================================
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     16.5535      1.426     11.609      0.000      13.752      19.355
black         -0.0363      0.004     -9.367      0.000      -0.044      -0.029
==============================================================================
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -3.3305      0.694     -4.801      0.000      -4.694      -1.968
lstat          0.5488      0.048     11.491      0.000       0.455       0.643
==============================================================================
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     11.7965      0.934     12.628      0.000       9.961      13.632
medv          -0.3632      0.038     -9.460      0.000      -0.439      -0.288
==============================================================================

2. In which of the models is there a statistically significant association between the predictor and the response? Create some plots to back up your assertions.

# Write your code below to answer the question

Compare your answer with the reference solution below

print("p < 0.05\n")

for model in models_a:
    if model.pvalues[1] < 0.05:
        print(model.params[1:].index[0])

print("\np > 0.05\n")

for model in models_a:
    if model.pvalues[1] > 0.05:
        print(model.params[1:].index[0])

#  zn, indu,s nox, rm, ag, dis, rad, tax, ptratio, black, lstat, medv  are higly significant with very small p values.


def plot_grid(df, response, cols):
    variables = df.columns.drop(response)
    for i in range(0, len(variables), cols):
        g = sns.pairplot(
            df, y_vars=[response], x_vars=variables[i : i + cols], height=4
        )
        g.map(sns.regplot)
    return


plot_grid(boston_df, "crim", 5)
p < 0.05

zn
indus
nox
rm
age
dis
rad
tax
ptratio
black
lstat
medv

p > 0.05

chas
../_images/extension-limitation_58_1.png ../_images/extension-limitation_58_2.png ../_images/extension-limitation_58_3.png

3. Fit a multiple regression model to predict the response using all of the predictors. Hint: See section 2.3.3.

# Write your code below to answer the question

Compare your answer with the reference solution below

response = "crim"
predictors = boston_df.columns.drop(response)
f = "{} ~ {}".format(response, "+".join(predictors))

model_b = ols(formula=f, data=boston_df).fit()
model_b.summary()
OLS Regression Results
Dep. Variable: crim R-squared: 0.454
Model: OLS Adj. R-squared: 0.440
Method: Least Squares F-statistic: 31.47
Date: Tue, 06 Feb 2024 Prob (F-statistic): 1.57e-56
Time: 20:20:16 Log-Likelihood: -1653.3
No. Observations: 506 AIC: 3335.
Df Residuals: 492 BIC: 3394.
Df Model: 13
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 17.0332 7.235 2.354 0.019 2.818 31.248
zn 0.0449 0.019 2.394 0.017 0.008 0.082
indus -0.0639 0.083 -0.766 0.444 -0.228 0.100
chas -0.7491 1.180 -0.635 0.526 -3.068 1.570
nox -10.3135 5.276 -1.955 0.051 -20.679 0.052
rm 0.4301 0.613 0.702 0.483 -0.774 1.634
age 0.0015 0.018 0.081 0.935 -0.034 0.037
dis -0.9872 0.282 -3.503 0.001 -1.541 -0.433
rad 0.5882 0.088 6.680 0.000 0.415 0.761
tax -0.0038 0.005 -0.733 0.464 -0.014 0.006
ptratio -0.2711 0.186 -1.454 0.147 -0.637 0.095
black -0.0075 0.004 -2.052 0.041 -0.015 -0.000
lstat 0.1262 0.076 1.667 0.096 -0.023 0.275
medv -0.1989 0.061 -3.287 0.001 -0.318 -0.080
Omnibus: 666.613 Durbin-Watson: 1.519
Prob(Omnibus): 0.000 Jarque-Bera (JB): 84887.625
Skew: 6.617 Prob(JB): 0.00
Kurtosis: 65.058 Cond. No. 1.58e+04


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.58e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

4. Which features seem to be statistically significant in this model?

# Write your code below to answer the question

Compare your answer with the reference solution below

print("p < 0.05\n")
model_b.pvalues[model_b.pvalues < 0.05]

# Only the resultent features seem to be statistically significant in this model.
p < 0.05
Intercept    1.894909e-02
zn           1.702489e-02
dis          5.022039e-04
rad          6.460451e-11
black        4.070233e-02
medv         1.086810e-03
dtype: float64

5. How do your results from Exercise 1 compare to your results from Exercise 3? Create a plot displaying the univariate regression coefficients from Exercise 1 on the \(x\)-axis, and the multiple regression coefficients from Exercise 3 on the \(y\)-axis. That is, each predictor is displayed as a single point in the plot. Its coefficient in a simple linear regression model is shown on the \(x\)-axis, and its coefficient estimate in the multiple linear regression model is shown on the \(y\)-axis.

# Write your code below to answer the question

Compare your answer with the reference solution below

# Get coefficients
univariate_params = pd.concat([m.params[1:] for m in models_a])
multivariate_params = model_b.params[1:]
df = pd.DataFrame(
    {
        "Univariate_coef": univariate_params,
        "Multivariate_coef": multivariate_params,
    }
)
print(df)

plt.figure(figsize=(20, 10))
ax = sns.scatterplot(x="Univariate_coef", y="Multivariate_coef", data=df)

# Multivariate regression found 5 of 13 predictors to be significnat where univariate regression found 12 of 13 significant. Multivariate regression seems to find significanlty less predictors to be significant.
         Univariate_coef  Multivariate_coef
zn             -0.073935           0.044855
indus           0.509776          -0.063855
chas           -1.892777          -0.749134
nox            31.248531         -10.313535
rm             -2.684051           0.430131
age             0.107786           0.001452
dis            -1.550902          -0.987176
rad             0.617911           0.588209
tax             0.029742          -0.003780
ptratio         1.151983          -0.271081
black          -0.036280          -0.007538
lstat           0.548805           0.126211
medv           -0.363160          -0.198887
../_images/extension-limitation_70_1.png

6. Is there evidence of nonlinear association between any of the predictors and the response? To answer this question, for each predictor \(X\), fit a model of the form

\[ y = \beta_0 + \beta_1 X + \beta_2 X^2 \beta_3 X^3 + \epsilon \]

Hint: See section 2.3.3.2.

# Write your code below to answer the question

Compare your answer with the reference solution below

models_d = [
    ols(
        formula="crim ~ {0} + np.power({0}, 2) + np.power({0}, 3)".format(f),
        data=boston_df,
    ).fit()
    for f in boston_df.columns[1:]
]

for model in models_d:
    print(model.summary().tables[1])


print("\nFeatures with p < 0.05\n")

sig = pd.concat([model.pvalues[model.pvalues < 0.05] for model in models_d])

print(pd.DataFrame({"P>|t|": sig.drop("Intercept")}))

# There is evidence of a nonlinear association between the repsonse and INDUS, NOX, AGE, DIS, PTRATIO, medv
# There is evidence of a linear association between the response and ZN
===================================================================================
                      coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------
Intercept           4.8461      0.433     11.192      0.000       3.995       5.697
zn                 -0.3322      0.110     -3.025      0.003      -0.548      -0.116
np.power(zn, 2)     0.0065      0.004      1.679      0.094      -0.001       0.014
np.power(zn, 3) -3.776e-05   3.14e-05     -1.203      0.230   -9.94e-05    2.39e-05
===================================================================================
======================================================================================
                         coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------
Intercept              3.6626      1.574      2.327      0.020       0.570       6.755
indus                 -1.9652      0.482     -4.077      0.000      -2.912      -1.018
np.power(indus, 2)     0.2519      0.039      6.407      0.000       0.175       0.329
np.power(indus, 3)    -0.0070      0.001     -7.292      0.000      -0.009      -0.005
======================================================================================
=====================================================================================
                        coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------
Intercept             3.7444      0.396      9.453      0.000       2.966       4.523
chas                 -0.6309      0.502     -1.257      0.209      -1.617       0.355
np.power(chas, 2)    -0.6309      0.502     -1.257      0.209      -1.617       0.355
np.power(chas, 3)    -0.6309      0.502     -1.257      0.209      -1.617       0.355
=====================================================================================
====================================================================================
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
Intercept          233.0866     33.643      6.928      0.000     166.988     299.185
nox              -1279.3713    170.397     -7.508      0.000   -1614.151    -944.591
np.power(nox, 2)  2248.5441    279.899      8.033      0.000    1698.626    2798.462
np.power(nox, 3) -1245.7029    149.282     -8.345      0.000   -1538.997    -952.409
====================================================================================
===================================================================================
                      coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------
Intercept         112.6246     64.517      1.746      0.081     -14.132     239.382
rm                -39.1501     31.311     -1.250      0.212    -100.668      22.368
np.power(rm, 2)     4.5509      5.010      0.908      0.364      -5.292      14.394
np.power(rm, 3)    -0.1745      0.264     -0.662      0.509      -0.693       0.344
===================================================================================
====================================================================================
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
Intercept           -2.5488      2.769     -0.920      0.358      -7.989       2.892
age                  0.2737      0.186      1.468      0.143      -0.093       0.640
np.power(age, 2)    -0.0072      0.004     -1.988      0.047      -0.014    -8.4e-05
np.power(age, 3)  5.745e-05   2.11e-05      2.724      0.007     1.6e-05    9.89e-05
====================================================================================
====================================================================================
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
Intercept           30.0476      2.446     12.285      0.000      25.242      34.853
dis                -15.5544      1.736     -8.960      0.000     -18.965     -12.144
np.power(dis, 2)     2.4521      0.346      7.078      0.000       1.771       3.133
np.power(dis, 3)    -0.1186      0.020     -5.814      0.000      -0.159      -0.079
====================================================================================
====================================================================================
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
Intercept           -0.6055      2.050     -0.295      0.768      -4.633       3.422
rad                  0.5127      1.044      0.491      0.623      -1.538       2.563
np.power(rad, 2)    -0.0752      0.149     -0.506      0.613      -0.367       0.217
np.power(rad, 3)     0.0032      0.005      0.703      0.482      -0.006       0.012
====================================================================================
====================================================================================
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
Intercept           19.1836     11.796      1.626      0.105      -3.991      42.358
tax                 -0.1533      0.096     -1.602      0.110      -0.341       0.035
np.power(tax, 2)     0.0004      0.000      1.488      0.137      -0.000       0.001
np.power(tax, 3) -2.204e-07   1.89e-07     -1.167      0.244   -5.91e-07    1.51e-07
====================================================================================
========================================================================================
                           coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------
Intercept              477.1840    156.795      3.043      0.002     169.129     785.239
ptratio                -82.3605     27.644     -2.979      0.003    -136.673     -28.048
np.power(ptratio, 2)     4.6353      1.608      2.882      0.004       1.475       7.795
np.power(ptratio, 3)    -0.0848      0.031     -2.743      0.006      -0.145      -0.024
========================================================================================
======================================================================================
                         coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------
Intercept             18.2637      2.305      7.924      0.000      13.735      22.792
black                 -0.0836      0.056     -1.483      0.139      -0.194       0.027
np.power(black, 2)     0.0002      0.000      0.716      0.474      -0.000       0.001
np.power(black, 3) -2.652e-07   4.36e-07     -0.608      0.544   -1.12e-06    5.92e-07
======================================================================================
======================================================================================
                         coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------
Intercept              1.2010      2.029      0.592      0.554      -2.785       5.187
lstat                 -0.4491      0.465     -0.966      0.335      -1.362       0.464
np.power(lstat, 2)     0.0558      0.030      1.852      0.065      -0.003       0.115
np.power(lstat, 3)    -0.0009      0.001     -1.517      0.130      -0.002       0.000
======================================================================================
=====================================================================================
                        coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------
Intercept            53.1655      3.356     15.840      0.000      46.571      59.760
medv                 -5.0948      0.434    -11.744      0.000      -5.947      -4.242
np.power(medv, 2)     0.1555      0.017      9.046      0.000       0.122       0.189
np.power(medv, 3)    -0.0015      0.000     -7.312      0.000      -0.002      -0.001
=====================================================================================

Features with p < 0.05

                             P>|t|
zn                    2.612296e-03
indus                 5.297064e-05
np.power(indus, 2)    3.420187e-10
np.power(indus, 3)    1.196405e-12
nox                   2.758372e-13
np.power(nox, 2)      6.811300e-15
np.power(nox, 3)      6.961110e-16
np.power(age, 2)      4.737733e-02
np.power(age, 3)      6.679915e-03
dis                   6.374792e-18
np.power(dis, 2)      4.941214e-12
np.power(dis, 3)      1.088832e-08
ptratio               3.028663e-03
np.power(ptratio, 2)  4.119552e-03
np.power(ptratio, 3)  6.300514e-03
medv                  2.637707e-28
np.power(medv, 2)     3.260523e-18
np.power(medv, 3)     1.046510e-12

7. Compare the differences between a linear regression model and a quadratic regression model for regressing ptratio onto crim. Does the quadratic term improves the fit to the data? Hint: See section 2.3.3.3.

# Write your code below to answer the question

Compare your answer with the reference solution below

from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
import seaborn as sns

regr = LinearRegression()

# Linear fit
X = boston_df.ptratio.values.reshape(-1, 1)
y = boston_df.crim
regr.fit(X, y)

boston_df["pred1"] = regr.predict(X)
boston_df["resid1"] = boston_df.crim - boston_df.pred1


boston_df["ptratio2"] = boston_df.ptratio**2
# Quadratic fit
X2 = boston_df[["ptratio", "ptratio2"]].values
regr.fit(X2, y)

boston_df["pred2"] = regr.predict(X2)
boston_df["resid2"] = boston_df.crim - boston_df.pred2


fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

# Left plot
sns.regplot(
    x=boston_df.pred1,
    y=boston_df.resid1,
    lowess=True,
    ax=ax1,
    line_kws={"color": "r", "lw": 1},
    scatter_kws={"facecolors": "None", "edgecolors": "k", "alpha": 0.5},
)
ax1.hlines(
    0,
    xmin=ax1.xaxis.get_data_interval()[0],
    xmax=ax1.xaxis.get_data_interval()[1],
    linestyles="dotted",
)
ax1.set_title("Residual Plot for Linear Fit")

# Right plot
sns.regplot(
    x=boston_df.pred2,
    y=boston_df.resid2,
    lowess=True,
    line_kws={"color": "r", "lw": 1},
    ax=ax2,
    scatter_kws={"facecolors": "None", "edgecolors": "k", "alpha": 0.5},
)
ax2.hlines(
    0,
    xmin=ax2.xaxis.get_data_interval()[0],
    xmax=ax2.xaxis.get_data_interval()[1],
    linestyles="dotted",
)
ax2.set_title("Residual Plot for Quadratic Fit")

for ax in fig.axes:
    ax.set_xlabel("Fitted values")
    ax.set_ylabel("Residuals")

# The left plot displays a residual plot from the linear regression of ptratio onto crim on the Boston dataset.
# The red line is a smooth fit to the residuals for identifying any trends in the residuals.
# Here, the residuals exhibit a tendency of U-shape, which provides a strong indication of nonlinearity in the data.


# The right panel displays the residual plot that results from the model contains a quadratic term.
# There appears to be little pattern in the residuals, suggesting that the quadratic term improves the fit to the data.
../_images/extension-limitation_78_0.png