Introuction to Statistical Learning with Application in Python (Gareth James et al., 2023, 1st edition)

My answers to the exercices on ISLP

Answers
Books
Author

Natnael Getahun

Published

February 21, 2026

2.4 Exercises

Conceptual

1.

  1. More flexible will be better.

Justification: as the number of data increases, we can be less concerned about variance, since new data points wouldn’t change the estimated f by that much. This means we can focus on decreasing bias, allowing us to focus on more flexible statistical learning methods.

  1. Less flexible will be better.

Justification: With a large number of predictors and small umber of observations, new data points can change our estimated f by a lot, forcing us to be more careful about the effects of variance. We can’t afford flexibility as it will come with high variance, even if it decreases bias.

  1. More flexible will be better.

Justification: With highly non-linear models, our concern lies that our estimated model will be far from the underlying model, i.e., huge bias. We can decrease bias by using more flexible methods.

  1. Less flexible will be better.

Justification: with high variance of error terms, using highly flexible statistical learning methods risks following the noise rather than the patter in the data. Newer data points will have the ability to change the estimated f significantly. We can reduce this by using less flexible learning methods.

2.

  1. regression, inference
  2. classification, prediction
  3. regression, prediction

3.

import numpy as np
import matplotlib.pyplot as plt
from sqlalchemy.engine import result

x = np.linspace(0.1, 10, 100)
bias_sq = 10 * np.exp(-0.5 * x) 
variance = 0.05 * (x**2)
bayes_error = np.full_like(x, 1.5)
training_error = 8 * np.exp(-0.6 * x) + 0.5
test_error = bias_sq + variance + bayes_error

plt.figure(figsize=(10, 6))
plt.plot(x, bias_sq, label='Squared Bias', color='#e74c3c', lw=2)
plt.plot(x, variance, label='Variance', color='#f1c40f', lw=2)
plt.plot(x, training_error, label='Training Error', color='#2ecc71', lw=2)
plt.plot(x, test_error, label='Test Error (MSE)', color='#3498db', lw=3)
plt.axhline(y=1.5, color='black', linestyle=':', label='Bayes Error', alpha=0.6)

optimal_x = x[np.argmin(test_error)]
plt.axvline(x=optimal_x, color='gray', linestyle='--', alpha=0.4, label='Optimal Flexibility')

plt.title('Bias-Variance Trade-off', fontsize=14)
plt.xlabel(r'Model Flexibility (High $\rightarrow$ Overfitting)', fontsize=12)
plt.ylabel('Error / Loss', fontsize=12)
plt.ylim(0, 12)
plt.legend(loc='upper center', bbox_to_anchor=(0.5, -0.15), ncol=3)
plt.grid(alpha=0.2)
plt.tight_layout()

plt.show()

  • Squared bias: as the flexiblity increases, the estimated function has better chance to approximate the underlying function, reducing bias.
  • Variance: as flexibility increases, the functions starts mirroring the noise in the data rather than the underlying pattern. This makes sensitive to changes in data, increasing variance.
  • Bayes (irreducible) error: This is the limiting factor. It can be error caused as a result of unaccounted for predictors or an underlying, unexplainable noise. The test error cannot cross below this line.
  • Training error: as the flexibility increases, the statistical method better fits the training data, whether it will generalize well to the test set or not.
  • Test error: This is measures the accuracy of the model when tested on previously unseen data. It is the sum of the bayes error, variance, and squared bias. As flexibilty increases from none to its optimal value (where the sum of variance and square bias is at its lowest), the training error drops. After this optimal value however, more flexibility increases the variance significantly.

4.

  1. Spam email detection, response = spam not spam, predictors = words and phrases used in the email, goal = prediction
  2. investment portofolio recommendation, response = selection from predefined portofolios, predictors = income, investment goal, age, etc., goal = inference
  3. Next word prediction, response = probability of next word, predictors = previous words, goal = prediction
  1. Stock market price movement prediction, response = future price, predictors = past closing prices, volume, moving averages, etc., goal = prediction
  2. Crop yield optimization, response = crop yield in a given time, predictors = types of seeds used, soil type, time of the year, etc., goal = inference
  3. Factors affecting employee salary, response = current employee salary, predictors = beginning salary, education level, age, etc., goal = inference
  1. Optimal health facility location
  2. Study of species spread
  3. Clustering based on attributes for better sampling

5.

Advantages: A very flexible approach can explain the data better when it is needed. Especially if relations are non-linear and rugged, a flexible approach can be useful. Flexible approaches result in models with lower bias.

Disadvantage: It can easily overfit the data, failing to generalize for unseen samples. I usually needs larger amounts of data than its less flexible alternatives when it works well.

A less flexible approach will be prefered in the case of fewer data. In cases were newly introduced data can change the model significantly.

6.

Parametric statistical learning approaches focus on finding parameters that can minimize certain loss functions (or maximize likelihood/ conditional probabilities of y given x). In other words, parameteric methods assume functional form. This approach makes parametric learning approaches simpler to build and easier to interpret. In cases where parameters are associated with predictors, we can infer and explain the effect of the predictors on the response variable. Non-parametric approaches don’t rely on finding values of parameters. This makes them harder to understand and difficult to interpret: they are primarily used for prediction purposes only, since inference is almost impossible.

7.

import numpy as np
import pandas as pd

df = pd.DataFrame({
    'X1' : [0, 2, 0, 0, -1, 1],
    'X2' : [3, 0, 1, 1, 0, 1],
    'X3' : [0, 0, 3, 2, 1, 1],
    'Y' : ['Red', 'Red', 'Red', 'Green', 'Green', 'Red']
    })
df
X1 X2 X3 Y
0 0 3 0 Red
1 2 0 0 Red
2 0 1 3 Red
3 0 1 2 Green
4 -1 0 1 Green
5 1 1 1 Red
x0 = np.zeros(3)
distance_list = list()
for i in df.index:
    norm = np.square(x0 - df.iloc[i, :-1]).sum()
    eculidean = np.sqrt(norm)
    distance_list.append({
        'obs' : i,
        'distance' : eculidean,
        'Y' : df.iloc[i, -1]
        })
distance_df = pd.DataFrame(distance_list)
distance_df
obs distance Y
0 0 3.000000 Red
1 1 2.000000 Red
2 2 3.162278 Red
3 3 2.236068 Green
4 4 1.414214 Green
5 5 1.732051 Red
def simple_knn(k=1):
    df = distance_df.sort_values(by='distance', ascending=True)
    df_of_interest = df.iloc[:k]
    return df_of_interest['Y'].mode().item()
prediction = simple_knn() 
print(f'For K = 1 our prediction is: {prediction}')
For K = 1 our prediction is: Green
prediction = simple_knn(k=3)
print(f'For K = 3 our prediction is: {prediction}')
For K = 3 our prediction is: Red

We need to choose ‘K’ so that KNN method becomes flexible enough to capture the non-linearity of the Bayes decision boundary. The lower or K the more flexible our methods becomes. We would expect the best value for K to be small.

Applied

8.

import pandas as pf
college = pd.read_csv('data/College.csv')
college
Unnamed: 0 Private Apps Accept Enroll Top10perc Top25perc F.Undergrad P.Undergrad Outstate Room.Board Books Personal PhD Terminal S.F.Ratio perc.alumni Expend Grad.Rate
0 Abilene Christian University Yes 1660 1232 721 23 52 2885 537 7440 3300 450 2200 70 78 18.1 12 7041 60
1 Adelphi University Yes 2186 1924 512 16 29 2683 1227 12280 6450 750 1500 29 30 12.2 16 10527 56
2 Adrian College Yes 1428 1097 336 22 50 1036 99 11250 3750 400 1165 53 66 12.9 30 8735 54
3 Agnes Scott College Yes 417 349 137 60 89 510 63 12960 5450 450 875 92 97 7.7 37 19016 59
4 Alaska Pacific University Yes 193 146 55 16 44 249 869 7560 4120 800 1500 76 72 11.9 2 10922 15
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
772 Worcester State College No 2197 1515 543 4 26 3089 2029 6797 3900 500 1200 60 60 21.0 14 4469 40
773 Xavier University Yes 1959 1805 695 24 47 2849 1107 11520 4960 600 1250 73 75 13.3 31 9189 83
774 Xavier University of Louisiana Yes 2097 1915 695 34 61 2793 166 6900 4200 617 781 67 75 14.4 20 8323 49
775 Yale University Yes 10705 2453 1317 95 99 5217 83 19840 6510 630 2115 96 96 5.8 49 40386 99
776 York College of Pennsylvania Yes 2989 1855 691 28 63 2988 1726 4990 3560 500 1250 75 75 18.1 28 4509 99

777 rows × 19 columns

college2 = pd.read_csv('data/College.csv', index_col=0)
college3 = college.rename({'Unnamed: 0': 'College'}, axis = 1)
college3 = college3.set_index('College')
college2
Private Apps Accept Enroll Top10perc Top25perc F.Undergrad P.Undergrad Outstate Room.Board Books Personal PhD Terminal S.F.Ratio perc.alumni Expend Grad.Rate
Abilene Christian University Yes 1660 1232 721 23 52 2885 537 7440 3300 450 2200 70 78 18.1 12 7041 60
Adelphi University Yes 2186 1924 512 16 29 2683 1227 12280 6450 750 1500 29 30 12.2 16 10527 56
Adrian College Yes 1428 1097 336 22 50 1036 99 11250 3750 400 1165 53 66 12.9 30 8735 54
Agnes Scott College Yes 417 349 137 60 89 510 63 12960 5450 450 875 92 97 7.7 37 19016 59
Alaska Pacific University Yes 193 146 55 16 44 249 869 7560 4120 800 1500 76 72 11.9 2 10922 15
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Worcester State College No 2197 1515 543 4 26 3089 2029 6797 3900 500 1200 60 60 21.0 14 4469 40
Xavier University Yes 1959 1805 695 24 47 2849 1107 11520 4960 600 1250 73 75 13.3 31 9189 83
Xavier University of Louisiana Yes 2097 1915 695 34 61 2793 166 6900 4200 617 781 67 75 14.4 20 8323 49
Yale University Yes 10705 2453 1317 95 99 5217 83 19840 6510 630 2115 96 96 5.8 49 40386 99
York College of Pennsylvania Yes 2989 1855 691 28 63 2988 1726 4990 3560 500 1250 75 75 18.1 28 4509 99

777 rows × 18 columns

college3
Private Apps Accept Enroll Top10perc Top25perc F.Undergrad P.Undergrad Outstate Room.Board Books Personal PhD Terminal S.F.Ratio perc.alumni Expend Grad.Rate
College
Abilene Christian University Yes 1660 1232 721 23 52 2885 537 7440 3300 450 2200 70 78 18.1 12 7041 60
Adelphi University Yes 2186 1924 512 16 29 2683 1227 12280 6450 750 1500 29 30 12.2 16 10527 56
Adrian College Yes 1428 1097 336 22 50 1036 99 11250 3750 400 1165 53 66 12.9 30 8735 54
Agnes Scott College Yes 417 349 137 60 89 510 63 12960 5450 450 875 92 97 7.7 37 19016 59
Alaska Pacific University Yes 193 146 55 16 44 249 869 7560 4120 800 1500 76 72 11.9 2 10922 15
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Worcester State College No 2197 1515 543 4 26 3089 2029 6797 3900 500 1200 60 60 21.0 14 4469 40
Xavier University Yes 1959 1805 695 24 47 2849 1107 11520 4960 600 1250 73 75 13.3 31 9189 83
Xavier University of Louisiana Yes 2097 1915 695 34 61 2793 166 6900 4200 617 781 67 75 14.4 20 8323 49
Yale University Yes 10705 2453 1317 95 99 5217 83 19840 6510 630 2115 96 96 5.8 49 40386 99
York College of Pennsylvania Yes 2989 1855 691 28 63 2988 1726 4990 3560 500 1250 75 75 18.1 28 4509 99

777 rows × 18 columns

college = college3
college.describe()
Apps Accept Enroll Top10perc Top25perc F.Undergrad P.Undergrad Outstate Room.Board Books Personal PhD Terminal S.F.Ratio perc.alumni Expend Grad.Rate
count 777.000000 777.000000 777.000000 777.000000 777.000000 777.000000 777.000000 777.000000 777.000000 777.000000 777.000000 777.000000 777.000000 777.000000 777.000000 777.000000 777.00000
mean 3001.638353 2018.804376 779.972973 27.558559 55.796654 3699.907336 855.298584 10440.669241 4357.526384 549.380952 1340.642214 72.660232 79.702703 14.089704 22.743887 9660.171171 65.46332
std 3870.201484 2451.113971 929.176190 17.640364 19.804778 4850.420531 1522.431887 4023.016484 1096.696416 165.105360 677.071454 16.328155 14.722359 3.958349 12.391801 5221.768440 17.17771
min 81.000000 72.000000 35.000000 1.000000 9.000000 139.000000 1.000000 2340.000000 1780.000000 96.000000 250.000000 8.000000 24.000000 2.500000 0.000000 3186.000000 10.00000
25% 776.000000 604.000000 242.000000 15.000000 41.000000 992.000000 95.000000 7320.000000 3597.000000 470.000000 850.000000 62.000000 71.000000 11.500000 13.000000 6751.000000 53.00000
50% 1558.000000 1110.000000 434.000000 23.000000 54.000000 1707.000000 353.000000 9990.000000 4200.000000 500.000000 1200.000000 75.000000 82.000000 13.600000 21.000000 8377.000000 65.00000
75% 3624.000000 2424.000000 902.000000 35.000000 69.000000 4005.000000 967.000000 12925.000000 5050.000000 600.000000 1700.000000 85.000000 92.000000 16.500000 31.000000 10830.000000 78.00000
max 48094.000000 26330.000000 6392.000000 96.000000 100.000000 31643.000000 21836.000000 21700.000000 8124.000000 2340.000000 6800.000000 103.000000 100.000000 39.800000 64.000000 56233.000000 118.00000
pd.plotting.scatter_matrix(college[['Top10perc', 'Apps', 'Enroll']])
array([[<Axes: xlabel='Top10perc', ylabel='Top10perc'>,
        <Axes: xlabel='Apps', ylabel='Top10perc'>,
        <Axes: xlabel='Enroll', ylabel='Top10perc'>],
       [<Axes: xlabel='Top10perc', ylabel='Apps'>,
        <Axes: xlabel='Apps', ylabel='Apps'>,
        <Axes: xlabel='Enroll', ylabel='Apps'>],
       [<Axes: xlabel='Top10perc', ylabel='Enroll'>,
        <Axes: xlabel='Apps', ylabel='Enroll'>,
        <Axes: xlabel='Enroll', ylabel='Enroll'>]], dtype=object)

college.boxplot('Outstate','Private')

college['Elite'] = pd.cut(college['Top10perc'], [0, 0.5, 1], labels = ['No', 'Yes'])
print(college['Elite'].value_counts())
college.boxplot('Outstate', 'Elite')
Elite
Yes    3
No     0
Name: count, dtype: int64

import matplotlib.pyplot as plt
fig, ax = plt.subplots(2, 2)
college.hist('Enroll', bins = 3, ax=ax[0,0])
college.hist('PhD', bins = 5, ax=ax[0,1])
college.hist('Books', bins = 8, ax=ax[1,0])
college.hist('Expend', bins = 14,  ax = ax[1,1])
array([<Axes: title={'center': 'Expend'}>], dtype=object)

9.

import pandas as pd

auto = pd.read_csv('data/Auto.csv')
auto.head()
mpg cylinders displacement horsepower weight acceleration year origin name
0 18.0 8 307.0 130 3504 12.0 70 1 chevrolet chevelle malibu
1 15.0 8 350.0 165 3693 11.5 70 1 buick skylark 320
2 18.0 8 318.0 150 3436 11.0 70 1 plymouth satellite
3 16.0 8 304.0 150 3433 12.0 70 1 amc rebel sst
4 17.0 8 302.0 140 3449 10.5 70 1 ford torino
auto.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 397 entries, 0 to 396
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           397 non-null    float64
 1   cylinders     397 non-null    int64  
 2   displacement  397 non-null    float64
 3   horsepower    397 non-null    object 
 4   weight        397 non-null    int64  
 5   acceleration  397 non-null    float64
 6   year          397 non-null    int64  
 7   origin        397 non-null    int64  
 8   name          397 non-null    object 
dtypes: float64(3), int64(4), object(2)
memory usage: 28.0+ KB
auto.nunique().sort_values(ascending=False)
weight          350
name            304
mpg             129
acceleration     95
horsepower       94
displacement     82
year             13
cylinders         5
origin            3
dtype: int64
print('Name, year, cylinders, and origin can be seen as categorical values. \n"Year" however can be treated a discreet numerical data. \n"Horsepower" has its type misclassified.')
Name, year, cylinders, and origin can be seen as categorical values. 
"Year" however can be treated a discreet numerical data. 
"Horsepower" has its type misclassified.
auto['horsepower'] = pd.to_numeric(auto['horsepower'], errors='coerce')
auto.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 397 entries, 0 to 396
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           397 non-null    float64
 1   cylinders     397 non-null    int64  
 2   displacement  397 non-null    float64
 3   horsepower    392 non-null    float64
 4   weight        397 non-null    int64  
 5   acceleration  397 non-null    float64
 6   year          397 non-null    int64  
 7   origin        397 non-null    int64  
 8   name          397 non-null    object 
dtypes: float64(4), int64(4), object(1)
memory usage: 28.0+ KB
for col in auto.columns:
    if 'int' in str(auto[col].dtype) or 'float' in str(auto[col].dtype):
        range = auto[col].max() - auto[col].min()
        print(col, '_range = ', range, sep='')
mpg_range = 37.6
cylinders_range = 5
displacement_range = 387.0
horsepower_range = 184.0
weight_range = 3527
acceleration_range = 16.8
year_range = 12
origin_range = 2
for col in auto.columns:
    if 'int' in str(auto[col].dtype) or 'float' in str(auto[col].dtype):
        col_std = auto[col].std().round(2)
        col_mean = auto[col].mean().round(2)
        print(f'{col}_std = {col_std}, {col}_mean = {col_mean}')
mpg_std = 7.83, mpg_mean = 23.52
cylinders_std = 1.7, cylinders_mean = 5.46
displacement_std = 104.38, displacement_mean = 193.53
horsepower_std = 38.49, horsepower_mean = 104.47
weight_std = 847.9, weight_mean = 2970.26
acceleration_std = 2.75, acceleration_mean = 15.56
year_std = 3.69, year_mean = 75.99
origin_std = 0.8, origin_mean = 1.57
auto_truncated = auto.drop(auto.iloc[9:85].index)

for col in auto_truncated.columns:
    if 'int' in str(auto_truncated[col].dtype) or 'float' in str(auto_truncated[col].dtype):
        col_std = auto_truncated[col].std().round(2)
        col_mean = auto_truncated[col].mean().round(2)
        print(f'{col}_std_truncted = {col_std}, {col}_mean_truncated = {col_mean}')
mpg_std_truncted = 7.91, mpg_mean_truncated = 24.44
cylinders_std_truncted = 1.65, cylinders_mean_truncated = 5.37
displacement_std_truncted = 99.64, displacement_mean_truncated = 187.05
horsepower_std_truncted = 35.9, horsepower_mean_truncated = 100.96
weight_std_truncted = 810.64, weight_mean_truncated = 2933.96
acceleration_std_truncted = 2.68, acceleration_mean_truncated = 15.72
year_std_truncted = 3.11, year_mean_truncated = 77.15
origin_std_truncted = 0.82, origin_mean_truncated = 1.6
import matplotlib.pyplot as plt
import seaborn as sns

sns.pairplot(auto, height=1, plot_kws= {'linewidth':0, 's':5})

sns.heatmap(auto.corr(numeric_only=True), annot=True)

print('Notice while mpg is negatively correlated with most variables, a non-linear fit might better explain the data.\nIf we are using learning methods that make assumptions on the variance of the error term, we need to be careful about multicollinearity.')
Notice while mpg is negatively correlated with most variables, a non-linear fit might better explain the data.
If we are using learning methods that make assumptions on the variance of the error term, we need to be careful about multicollinearity.
print('We see strong correlations between "mpg" and other variables, whether postive or negative. \nWhile our correlation mapping suggests that a linear regression might do well, \nour scatter plots suggest we will benefit in exploring more flexible methods that can follow the slight non-linearity.')
We see strong correlations between "mpg" and other variables, whether postive or negative. 
While our correlation mapping suggests that a linear regression might do well, 
our scatter plots suggest we will benefit in exploring more flexible methods that can follow the slight non-linearity.

10.

from ISLP import load_data
boston = load_data('Boston')
boston.head()
crim zn indus chas nox rm age dis rad tax ptratio lstat medv
0 0.00632 18.0 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 4.98 24.0
1 0.02731 0.0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 9.14 21.6
2 0.02729 0.0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 4.03 34.7
3 0.03237 0.0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 2.94 33.4
4 0.06905 0.0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 5.33 36.2
n_row = boston.shape[0]
n_col = boston.shape[1]
cols = ''
for col in boston.columns:
    cols += col + ' '
print(f'There are {n_row} rows.')
print(f'There are {n_col} cols.')
print(f'The rows represent each instance/observation.\nThe columns represent the following variables: {cols}.')
There are 506 rows.
There are 13 cols.
The rows represent each instance/observation.
The columns represent the following variables: crim zn indus chas nox rm age dis rad tax ptratio lstat medv .
sns.pairplot(boston, height=1, plot_kws= {'linewidth':0, 's':5})
print('We can suspect "rad", "tax", "ptratio" variables have been capped or had their missing values filled or have a strong "mode".\nTransformation of per capita crime rate can help out our model.\nTransformation of other variables as well might not be a bad idea.\nMight need to be careful with multicollinearity.')
We can suspect "rad", "tax", "ptratio" variables have been capped or had their missing values filled or have a strong "mode".
Transformation of per capita crime rate can help out our model.
Transformation of other variables as well might not be a bad idea.
Might need to be careful with multicollinearity.

sns.heatmap(boston.corr(numeric_only=True), annot=True)
print('Many variables promise to be associated with per cpaita crime rate. \n"nox", "age", and "lstat" appear to be positively correlated, \nwhile we should be creaful about claiming "rad" and "tax" as good predictors as the story person correlation and our scatter plots tell as are not the same.\n"medv", "dis", and "rm" appear to be negatively correlated with per capita crime rate.')
Many variables promise to be associated with per cpaita crime rate. 
"nox", "age", and "lstat" appear to be positively correlated, 
while we should be creaful about claiming "rad" and "tax" as good predictors as the story person correlation and our scatter plots tell as are not the same.
"medv", "dis", and "rm" appear to be negatively correlated with per capita crime rate.

boston.sort_values('crim', ascending=False)['crim']
380    88.97620
418    73.53410
405    67.92080
410    51.13580
414    45.74610
         ...   
55      0.01311
341     0.01301
285     0.01096
284     0.00906
0       0.00632
Name: crim, Length: 506, dtype: float64
boston.sort_values('tax', ascending=False)['tax']
492    711
491    711
489    711
490    711
488    711
      ... 
121    188
120    188
124    188
126    188
353    187
Name: tax, Length: 506, dtype: int64
boston.sort_values('ptratio', ascending=False)['ptratio']
355    22.0
354    22.0
129    21.2
128    21.2
133    21.2
       ... 
263    13.0
257    13.0
198    12.6
197    12.6
196    12.6
Name: ptratio, Length: 506, dtype: float64
for col in boston.columns:
    if 'int' in str(boston[col].dtype) or 'float' in str(boston[col].dtype):
        range = boston[col].max() - boston[col].min()
        print(col, '_range = ', range, sep='')
crim_range = 88.96988
zn_range = 100.0
indus_range = 27.279999999999998
chas_range = 1
nox_range = 0.486
rm_range = 5.218999999999999
age_range = 97.1
dis_range = 10.9969
rad_range = 23
tax_range = 524
ptratio_range = 9.4
lstat_range = 36.24
medv_range = 45.0
boston.value_counts('chas')
chas
0    471
1     35
Name: count, dtype: int64
int(boston['ptratio'].median())
19
min_medv = boston.medv.min()
boston[boston['medv'] == min_medv]
crim zn indus chas nox rm age dis rad tax ptratio lstat medv
398 38.3518 0.0 18.1 0 0.693 5.453 100.0 1.4896 24 666 20.2 30.59 5.0
405 67.9208 0.0 18.1 0 0.693 5.683 100.0 1.4254 24 666 20.2 22.98 5.0
len(boston[boston['rm'] > 7].index)
64
len(boston[boston['rm'] > 8].index)
13

3.7 Exercises

Conceptual

1.

Intercept: \[\begin{aligned} H_0 &: \beta_0 = 0 \\ H_1 &: \beta_0 \neq 0 \end{aligned}\] Decision: Reject the null hypothesis. The p-value suggests that the intercept is significant. When all advertising forms are zero there will \(2.939 \times 1000\) units of sale.

TV: \[\begin{aligned} H_0 &: \beta_1 = 0 \\ H_1 &: \beta_1 \neq 0 \end{aligned}\] Decision: Reject the null hypothesis. The p-value suggests TV advertising is a significant predictor of sales. Holding all other forms of advertising constant, a one unit increase in TV advertising results in an average increase of sales by \(0.046 \times 1000\) units.

Radio: \[\begin{aligned} H_0 &: \beta_2 = 0 \\ H_1 &: \beta_2 \neq 0 \end{aligned}\] Decision: Reject the null hypothesis. The p-value suggests radio advertising is a significant predictor of sales. Holding all other forms of advertising constant, a one unit increase is radio advertising results in an average increase of sales by \(0.189 \times 1000\) units.

Newspaper: \[\begin{aligned} H_0 &: \beta_3 =0 \\ H_1 &: \beta_3 \neq 0 \end{aligned}\] Decision: Fail to reject the null hypothesis. The p-value suggests newspaper advertising is not a significant predictors of sales at 95% confidence level. There is no point in interpreting a non-significant predictor’s coeffficient.

2.

They are both based on the k-nearest observations and assignment of a value based on the values of the nearest observations to a newcoming value, yet the KNN classifier can only designate one of provided discreet categorical labels to the newcomer. In contrast, the KNN regression can assign any value along the contiuous scale, as long as it is the mean of the k-nearest observations (assuming we are using mean).

The KNN regression uses mean as a way to designate values to a newcomer. The KKN classifier uses conditional prbability (mode).

Of course, the KNN classifier and KNN regression are different in the common ways regression differs from classification. Classification produces decision boundaries, while regression produces curves. Classfication is evaluated with precision, recall, etc. Rgression is evaluated by RMSE, MAE, etc.

3.

  1. The correct answer: iii.

Reason: For fixed vlaues of IQ and GPA, the average starting salary (in thousands) cahnges by \(35 + (-10 \times GPA)\) for students in college, and by 0 for high school level. If the fixed GPA is below 3.5 college graduates earn more, on average, than high school graduates. If the fixed GPA is 3.5, both categories, on average, earn the same. If the fixed GPA is above 3.5 college graduates earn less, on average, than highschool graduates.

\[\begin{aligned} \hat{y} &= \hat{\beta}_0 + \hat{\beta}_1 \text{GPA} + \hat{\beta}_2 \text{IQ} + \hat{\beta}_3 \text{Level} + \hat{\beta}_4 (\text{IQ} \times \text{GPA}) + \hat{\beta}_5 (\text{GPA} \times \text{Level}) \\ &= 30 + 20(4.0) + 0.07(110) + 35(1) + 0.01(110 \times 4.0) - 10(4.0 \times 1) \\ &= 30 + 80 + 7.7 + 35 + 4.4 - 40 \\ &= 117.1 (in \: thousands) \end{aligned}\]

False.

Reason: We can’t deduce about the presence of an interaction effect using only the coefficient values. We need to test significance of the coefficient properly. If the standard error of the coefficient is really small, even small coefficient values can be deemed significant.

4.

For the training set, I expect the cubic regression to be lower. The cubic regression has more flexibility, and with the truly linear model underneath, the cubic and quadratic term will most likely start following the noise in the data rahter than the true pattern. This will move some of the error that should have been in the RSS to the ESS (explained sum square/ error explained by the regression), thus reducing the RSS.

I expect the case to be the opposite for the test set. The linear model is most likely to result in a lower RSS than the cubic regression. This is comes as a consequence of the cubic regression following the noise of the training data rather than the linearity of the underlying model.

Even though one can make the argument that we don’t have enough information since we don’t know the actual form of non-linearity in the underlying model, I think there is enough information to make an expected guess. As a results of its flexibility, we can expect the cubic model to have a better chance of approximating the non-linearity in the underlying patter, resulting lower RSS.

Same answer as c (epxected lower test RSS for cubic regression).

5.

\[\begin{aligned} \hat{y}_i &= x_i \hat{\beta} \\ \hat{\beta} &= \frac{\sum_{j=1}^n x_j y_j}{\sum_{k=1}^n x_k^2} \: (with \: indexing \: changed \: to \: remove \: comfusion) \\ \end{aligned}\] \[\begin{aligned} \hat{y}_i &= x_i \cdot \frac{\sum_{j=1}^n x_j y_j}{\sum_{k=1}^n x_k^2} \\ &= \frac{\sum_{j=1}^n x_i x_j y_j}{\sum_{k=1}^n x_k^2}\\ &= \sum_{i'=1}^n \left(\frac{x_i x_{i'}}{\sum_{k=1}^n x_k^2}\right) \cdot y_{i'}\\ &= \sum_{i'=1}^n a_{i'}y_{i'} \quad \text{where,} \: a_{i'} = \frac{x_i x_{i'}}{\sum_{k=1}^n x_k^2} \end{aligned}\]

6.

\[\begin{aligned} \hat{y_i} &= \hat{\beta_0} + x_i \hat{\beta_1} \\ &= \left(\bar{y} - \hat{\beta_1} \bar{x} \right) + x_i \hat{\beta_1} \end{aligned}\]

Taking \(x_i = \bar{x}\): \[\begin{aligned} \hat{y_i} &= \bar{y} - \hat{\beta_1} \bar{x} + \hat{\beta_1} \bar{x} &=\bar{y} \end{aligned}\]

7.

\[\begin{aligned} R^2 &= 1 - \frac{\text{RSS}}{\text{TSS}} \\ &= \frac{\text{ESS}}{\text{TSS}} \\ &= \frac{\sum_{i=1}^n (\hat{y}_i - \bar{y})^2}{\sum_{i=1}^n (y_i - \bar{y})^2} \\[0.8em] &\text{since } \bar{y} = 0, \\[0.8em] &= \frac{\sum_{i=1}^n \hat{y}_i^2}{\sum_{i=1}^n y_i^2} \\[0.8em] &\text{and } \hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_i, \\ &\text{but } \bar{x} = 0 \;\; \Rightarrow \;\; \hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x} = 0, \\ &\therefore \hat{y}_i = \hat{\beta}_1 x_i \\[0.8em] &\text{so} \\[0.6em] R^2 &= \frac{\sum_{i=1}^n (\hat{\beta}_1 x_i)^2}{\sum_{i=1}^n y_i^2} \\ &= \frac{\hat{\beta}_1^2 \sum_{i=1}^n x_i^2}{\sum_{i=1}^n y_i^2} \\[0.8em] &\text{and since (when }\bar{x}=0, \: \bar{y}=0\text{)} \\[0.8em] \hat{\beta}_1 &= \frac{\sum_{i=1}^n x_i y_i}{\sum_{i=1}^n x_i^2}, \\[0.8em] &\text{therefore} \\[0.8em] R^2 &= \frac{\left( \frac{\sum x_i y_i}{\sum x_i^2} \right)^2 \sum x_i^2}{\sum y_i^2} \\ &= \frac{\left( \sum x_i y_i \right)^2}{\left( \sum x_i^2 \right) \left( \sum y_i^2 \right)} \\[0.8em] \sqrt{R^2} &= \frac{\sum_{i=1}^n x_i y_i}{\sqrt{\sum_{i=1}^n x_i^2} \: \sqrt{\sum_{i=1}^n y_i^2}} \\ &= \frac{\sum_{i=1}^n \left(x_i - \bar{x} \right) \left(y_i - \bar{y} \right)}{\sqrt{\sum_{i=1}^n x_i^2} \: \sqrt{\sum_{i=1}^n y_i^2}} \end{aligned}\]

Applied

8.

from ISLP import load_data
from ISLP.models import ModelSpec as MS
from ISLP.models import summarize
import statsmodels.api as sm

auto = load_data('Auto')
y = auto['mpg']
X = MS(['horsepower']).fit_transform(auto)
model_8 = sm.OLS(y, X)
results = model_8.fit()
summarize(results)
coef std err t P>|t|
intercept 39.9359 0.717 55.660 0.0
horsepower -0.1578 0.006 -24.489 0.0
  1. Yes. The p-value for horsepower is less than 0.05, indicating sigificance of relationship at a 95% confidence level.

  2. A one unit change in horsepower will result is an 0.1578 units dicrease in average mpg.

  3. The relationship is negative as we can see from the coefficient.

import pandas as pd

new_data = pd.DataFrame({'horsepower' : [98]})
X_new = MS(['horsepower']).fit_transform(new_data)
prediction = results.get_prediction(X_new)
print(f"prediction: {prediction.predicted[0]}")

conf_interval = prediction.conf_int(alpha = 0.05)
pred_interval = prediction.conf_int(alpha = 0.05, obs=True)
print(f'Confidence Interval: {conf_interval} \nPrediction Interval: {pred_interval}')
prediction: 24.467077152512424
Confidence Interval: [[23.97307896 24.96107534]] 
Prediction Interval: [[14.80939607 34.12475823]]
import matplotlib.pyplot as plt

parameters = results.params
fig, ax = plt.subplots(figsize = (5,5))
ax.scatter(auto['horsepower'], auto['mpg'])
ax.axline(xy1=(0, parameters['intercept']), slope=parameters['horsepower'], color = 'red', linewidth=2)
ax.set_xlabel('horsepower')
ax.set_ylabel('mpg')
plt.show()

fig, ax = plt.subplots(figsize=(5,5))
ax.scatter(results.predict(), results.resid)
ax.axhline(0, linestyle='--')
ax.set_title('Fitted valued vs. Residuals')
ax.set_ylabel('Residuals')
ax.set_xlabel('Fitted Values')
plt.show()

We can see a clear non-linearity in the data here. The relationship looks like it is quadratic.

fig, ax = plt.subplots(figsize=(5,5))
influence = results.get_influence()
leverage = influence.hat_matrix_diag
studentized_resid = influence.resid_studentized_internal
ax.scatter(leverage, studentized_resid)
ax.axhline(0, linestyle='--') 
ax.set_title('Leverage Plot')
ax.set_xlabel('Leverage')
ax.set_ylabel('Studentized Residuals')
plt.show()

We might need to keep leverage in the back of our mind. This leverage can actually be a consequence of non-linearity.

9.

import seaborn as sns

fig, ax = plt.subplots()
sns.heatmap(auto.corr(), annot=True, cmap='RdBu', ax=ax)
ax.set_title('Correlation Matrix')
plt.show()

auto.corr()
mpg cylinders displacement horsepower weight acceleration year origin
mpg 1.000000 -0.777618 -0.805127 -0.778427 -0.832244 0.423329 0.580541 0.565209
cylinders -0.777618 1.000000 0.950823 0.842983 0.897527 -0.504683 -0.345647 -0.568932
displacement -0.805127 0.950823 1.000000 0.897257 0.932994 -0.543800 -0.369855 -0.614535
horsepower -0.778427 0.842983 0.897257 1.000000 0.864538 -0.689196 -0.416361 -0.455171
weight -0.832244 0.897527 0.932994 0.864538 1.000000 -0.416839 -0.309120 -0.585005
acceleration 0.423329 -0.504683 -0.543800 -0.689196 -0.416839 1.000000 0.290316 0.212746
year 0.580541 -0.345647 -0.369855 -0.416361 -0.309120 0.290316 1.000000 0.181528
origin 0.565209 -0.568932 -0.614535 -0.455171 -0.585005 0.212746 0.181528 1.000000
# there is no attribute called name in 'auto' dataset
predictors = auto.columns.drop(['mpg'])
X_9 = MS(predictors).fit_transform(auto)
model_9 = sm.OLS(y, X_9)
results_9 = model_9.fit()
summarize(results_9)
coef std err t P>|t|
intercept -17.2184 4.644 -3.707 0.000
cylinders -0.4934 0.323 -1.526 0.128
displacement 0.0199 0.008 2.647 0.008
horsepower -0.0170 0.014 -1.230 0.220
weight -0.0065 0.001 -9.929 0.000
acceleration 0.0806 0.099 0.815 0.415
year 0.7508 0.051 14.729 0.000
origin 1.4261 0.278 5.127 0.000
print(f'F-value: {results_9.fvalue} \nP-value: {results_9.f_pvalue}')
F-value: 252.42804529131908 
P-value: 2.037105930754976e-139

Yes, there is a relationship between the predictors and the response.

Even with suspected huge multicollinearity, which tends to hid significant variables, all other predictors excpet cylinders, horsepower, and acceleration appear signidicant to the model.

A one year increase, keeping all other factors constant, is associated with a 0.7508 change in mpg.

fig, ax = plt.subplots(figsize=(5,5))
ax.scatter(results_9.predict(), results_9.resid)
ax.axhline(0, linestyle='--')
ax.set_title('Fitted valued vs. Residuals')
ax.set_ylabel('Residuals')
ax.set_xlabel('Fitted Values')
plt.show()

The residual vs. fitted plots suggest that, similar to the simple linear regresion we fitted earlier, there is osme non-linearity we haven’t captured yet.

fig, ax = plt.subplots(figsize=(5,5))
influence_9 = results_9.get_influence()
leverage_9 = influence_9.hat_matrix_diag
average_leverage = len(X_9.columns) / len(X_9)
warning_leverage = 2 * average_leverage
studentized_resid_9 = influence_9.resid_studentized_internal

ax.scatter(leverage_9, studentized_resid_9)
ax.axhline(0, linestyle='--') 
ax.axvline(average_leverage, linestyle='--')
ax.axvline(warning_leverage, linestyle = '--', color='r')
ax.set_title('Leverage Plot')
ax.set_xlabel('Leverage')
ax.set_ylabel('Studentized Residuals')
plt.show()

There are some points that both exceed the average leverage and are greater that |2| studentized residuals. There are a few points that exceed similar limits for the studentized residuals while reaching past 2 times the average leverage. One point can be seen with really high leverage but within |2| studentized residuals. Outliers clearly exist, with most being below two times the average leverage. There definetely is an unusual high leverage going on here.

sns.pairplot(auto, height=1, plot_kws= {'linewidth':0, 's':5})
plt.show()