Introuction to Statistical Learning with Application in Python (Gareth James et al., 2023, 1st edition)

My answers to the exercices on ISLP

Answers

Books

Author

Natnael Getahun

Published

February 21, 2026

2.4 Exercises

Conceptual

1.

More flexible will be better.

Justification: as the number of data increases, we can be less concerned about variance, since new data points wouldn’t change the estimated f by that much. This means we can focus on decreasing bias, allowing us to focus on more flexible statistical learning methods.

Less flexible will be better.

Justification: With a large number of predictors and small umber of observations, new data points can change our estimated f by a lot, forcing us to be more careful about the effects of variance. We can’t afford flexibility as it will come with high variance, even if it decreases bias.

More flexible will be better.

Justification: With highly non-linear models, our concern lies that our estimated model will be far from the underlying model, i.e., huge bias. We can decrease bias by using more flexible methods.

Less flexible will be better.

Justification: with high variance of error terms, using highly flexible statistical learning methods risks following the noise rather than the patter in the data. Newer data points will have the ability to change the estimated f significantly. We can reduce this by using less flexible learning methods.

2.

regression, inference
classification, prediction
regression, prediction

3.

import numpy as np
import matplotlib.pyplot as plt
from sqlalchemy.engine import result

x = np.linspace(0.1, 10, 100)
bias_sq = 10 * np.exp(-0.5 * x) 
variance = 0.05 * (x**2)
bayes_error = np.full_like(x, 1.5)
training_error = 8 * np.exp(-0.6 * x) + 0.5
test_error = bias_sq + variance + bayes_error

plt.figure(figsize=(10, 6))
plt.plot(x, bias_sq, label='Squared Bias', color='#e74c3c', lw=2)
plt.plot(x, variance, label='Variance', color='#f1c40f', lw=2)
plt.plot(x, training_error, label='Training Error', color='#2ecc71', lw=2)
plt.plot(x, test_error, label='Test Error (MSE)', color='#3498db', lw=3)
plt.axhline(y=1.5, color='black', linestyle=':', label='Bayes Error', alpha=0.6)

optimal_x = x[np.argmin(test_error)]
plt.axvline(x=optimal_x, color='gray', linestyle='--', alpha=0.4, label='Optimal Flexibility')

plt.title('Bias-Variance Trade-off', fontsize=14)
plt.xlabel(r'Model Flexibility (High $\rightarrow$ Overfitting)', fontsize=12)
plt.ylabel('Error / Loss', fontsize=12)
plt.ylim(0, 12)
plt.legend(loc='upper center', bbox_to_anchor=(0.5, -0.15), ncol=3)
plt.grid(alpha=0.2)
plt.tight_layout()

plt.show()

Squared bias: as the flexiblity increases, the estimated function has better chance to approximate the underlying function, reducing bias.
Variance: as flexibility increases, the functions starts mirroring the noise in the data rather than the underlying pattern. This makes sensitive to changes in data, increasing variance.
Bayes (irreducible) error: This is the limiting factor. It can be error caused as a result of unaccounted for predictors or an underlying, unexplainable noise. The test error cannot cross below this line.
Training error: as the flexibility increases, the statistical method better fits the training data, whether it will generalize well to the test set or not.
Test error: This is measures the accuracy of the model when tested on previously unseen data. It is the sum of the bayes error, variance, and squared bias. As flexibilty increases from none to its optimal value (where the sum of variance and square bias is at its lowest), the training error drops. After this optimal value however, more flexibility increases the variance significantly.

4.

Spam email detection, response = spam not spam, predictors = words and phrases used in the email, goal = prediction
investment portofolio recommendation, response = selection from predefined portofolios, predictors = income, investment goal, age, etc., goal = inference
Next word prediction, response = probability of next word, predictors = previous words, goal = prediction

Stock market price movement prediction, response = future price, predictors = past closing prices, volume, moving averages, etc., goal = prediction
Crop yield optimization, response = crop yield in a given time, predictors = types of seeds used, soil type, time of the year, etc., goal = inference
Factors affecting employee salary, response = current employee salary, predictors = beginning salary, education level, age, etc., goal = inference

Optimal health facility location
Study of species spread
Clustering based on attributes for better sampling

5.

Advantages: A very flexible approach can explain the data better when it is needed. Especially if relations are non-linear and rugged, a flexible approach can be useful. Flexible approaches result in models with lower bias.

Disadvantage: It can easily overfit the data, failing to generalize for unseen samples. I usually needs larger amounts of data than its less flexible alternatives when it works well.

A less flexible approach will be prefered in the case of fewer data. In cases were newly introduced data can change the model significantly.

6.

Parametric statistical learning approaches focus on finding parameters that can minimize certain loss functions (or maximize likelihood/ conditional probabilities of y given x). In other words, parameteric methods assume functional form. This approach makes parametric learning approaches simpler to build and easier to interpret. In cases where parameters are associated with predictors, we can infer and explain the effect of the predictors on the response variable. Non-parametric approaches don’t rely on finding values of parameters. This makes them harder to understand and difficult to interpret: they are primarily used for prediction purposes only, since inference is almost impossible.

7.

import numpy as np
import pandas as pd

df = pd.DataFrame({
    'X1' : [0, 2, 0, 0, -1, 1],
    'X2' : [3, 0, 1, 1, 0, 1],
    'X3' : [0, 0, 3, 2, 1, 1],
    'Y' : ['Red', 'Red', 'Red', 'Green', 'Green', 'Red']
    })
df

	X1	X2	X3	Y
0	0	3	0	Red
1	2	0	0	Red
2	0	1	3	Red
3	0	1	2	Green
4	-1	0	1	Green
5	1	1	1	Red

x0 = np.zeros(3)
distance_list = list()
for i in df.index:
    norm = np.square(x0 - df.iloc[i, :-1]).sum()
    eculidean = np.sqrt(norm)
    distance_list.append({
        'obs' : i,
        'distance' : eculidean,
        'Y' : df.iloc[i, -1]
        })
distance_df = pd.DataFrame(distance_list)
distance_df

	obs	distance	Y
0	0	3.000000	Red
1	1	2.000000	Red
2	2	3.162278	Red
3	3	2.236068	Green
4	4	1.414214	Green
5	5	1.732051	Red

def simple_knn(k=1):
    df = distance_df.sort_values(by='distance', ascending=True)
    df_of_interest = df.iloc[:k]
    return df_of_interest['Y'].mode().item()
prediction = simple_knn() 
print(f'For K = 1 our prediction is: {prediction}')

For K = 1 our prediction is: Green

prediction = simple_knn(k=3)
print(f'For K = 3 our prediction is: {prediction}')

For K = 3 our prediction is: Red

We need to choose ‘K’ so that KNN method becomes flexible enough to capture the non-linearity of the Bayes decision boundary. The lower or K the more flexible our methods becomes. We would expect the best value for K to be small.

Applied

8.

import pandas as pf
college = pd.read_csv('data/College.csv')

college

	Unnamed: 0	Private	Apps	Accept	Enroll	Top10perc	Top25perc	F.Undergrad	P.Undergrad	Outstate	Room.Board	Books	Personal	PhD	Terminal	S.F.Ratio	perc.alumni	Expend	Grad.Rate
0	Abilene Christian University	Yes	1660	1232	721	23	52	2885	537	7440	3300	450	2200	70	78	18.1	12	7041	60
1	Adelphi University	Yes	2186	1924	512	16	29	2683	1227	12280	6450	750	1500	29	30	12.2	16	10527	56
2	Adrian College	Yes	1428	1097	336	22	50	1036	99	11250	3750	400	1165	53	66	12.9	30	8735	54
3	Agnes Scott College	Yes	417	349	137	60	89	510	63	12960	5450	450	875	92	97	7.7	37	19016	59
4	Alaska Pacific University	Yes	193	146	55	16	44	249	869	7560	4120	800	1500	76	72	11.9	2	10922	15
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
772	Worcester State College	No	2197	1515	543	4	26	3089	2029	6797	3900	500	1200	60	60	21.0	14	4469	40
773	Xavier University	Yes	1959	1805	695	24	47	2849	1107	11520	4960	600	1250	73	75	13.3	31	9189	83
774	Xavier University of Louisiana	Yes	2097	1915	695	34	61	2793	166	6900	4200	617	781	67	75	14.4	20	8323	49
775	Yale University	Yes	10705	2453	1317	95	99	5217	83	19840	6510	630	2115	96	96	5.8	49	40386	99
776	York College of Pennsylvania	Yes	2989	1855	691	28	63	2988	1726	4990	3560	500	1250	75	75	18.1	28	4509	99

777 rows × 19 columns

college2 = pd.read_csv('data/College.csv', index_col=0)
college3 = college.rename({'Unnamed: 0': 'College'}, axis = 1)
college3 = college3.set_index('College')

college2

	Private	Apps	Accept	Enroll	Top10perc	Top25perc	F.Undergrad	P.Undergrad	Outstate	Room.Board	Books	Personal	PhD	Terminal	S.F.Ratio	perc.alumni	Expend	Grad.Rate
Abilene Christian University	Yes	1660	1232	721	23	52	2885	537	7440	3300	450	2200	70	78	18.1	12	7041	60
Adelphi University	Yes	2186	1924	512	16	29	2683	1227	12280	6450	750	1500	29	30	12.2	16	10527	56
Adrian College	Yes	1428	1097	336	22	50	1036	99	11250	3750	400	1165	53	66	12.9	30	8735	54
Agnes Scott College	Yes	417	349	137	60	89	510	63	12960	5450	450	875	92	97	7.7	37	19016	59
Alaska Pacific University	Yes	193	146	55	16	44	249	869	7560	4120	800	1500	76	72	11.9	2	10922	15
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
Worcester State College	No	2197	1515	543	4	26	3089	2029	6797	3900	500	1200	60	60	21.0	14	4469	40
Xavier University	Yes	1959	1805	695	24	47	2849	1107	11520	4960	600	1250	73	75	13.3	31	9189	83
Xavier University of Louisiana	Yes	2097	1915	695	34	61	2793	166	6900	4200	617	781	67	75	14.4	20	8323	49
Yale University	Yes	10705	2453	1317	95	99	5217	83	19840	6510	630	2115	96	96	5.8	49	40386	99
York College of Pennsylvania	Yes	2989	1855	691	28	63	2988	1726	4990	3560	500	1250	75	75	18.1	28	4509	99

777 rows × 18 columns

college3

	Private	Apps	Accept	Enroll	Top10perc	Top25perc	F.Undergrad	P.Undergrad	Outstate	Room.Board	Books	Personal	PhD	Terminal	S.F.Ratio	perc.alumni	Expend	Grad.Rate
College
Abilene Christian University	Yes	1660	1232	721	23	52	2885	537	7440	3300	450	2200	70	78	18.1	12	7041	60
Adelphi University	Yes	2186	1924	512	16	29	2683	1227	12280	6450	750	1500	29	30	12.2	16	10527	56
Adrian College	Yes	1428	1097	336	22	50	1036	99	11250	3750	400	1165	53	66	12.9	30	8735	54
Agnes Scott College	Yes	417	349	137	60	89	510	63	12960	5450	450	875	92	97	7.7	37	19016	59
Alaska Pacific University	Yes	193	146	55	16	44	249	869	7560	4120	800	1500	76	72	11.9	2	10922	15
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
Worcester State College	No	2197	1515	543	4	26	3089	2029	6797	3900	500	1200	60	60	21.0	14	4469	40
Xavier University	Yes	1959	1805	695	24	47	2849	1107	11520	4960	600	1250	73	75	13.3	31	9189	83
Xavier University of Louisiana	Yes	2097	1915	695	34	61	2793	166	6900	4200	617	781	67	75	14.4	20	8323	49
Yale University	Yes	10705	2453	1317	95	99	5217	83	19840	6510	630	2115	96	96	5.8	49	40386	99
York College of Pennsylvania	Yes	2989	1855	691	28	63	2988	1726	4990	3560	500	1250	75	75	18.1	28	4509	99

777 rows × 18 columns

college = college3

college.describe()

	Apps	Accept	Enroll	Top10perc	Top25perc	F.Undergrad	P.Undergrad	Outstate	Room.Board	Books	Personal	PhD	Terminal	S.F.Ratio	perc.alumni	Expend	Grad.Rate
count	777.000000	777.000000	777.000000	777.000000	777.000000	777.000000	777.000000	777.000000	777.000000	777.000000	777.000000	777.000000	777.000000	777.000000	777.000000	777.000000	777.00000
mean	3001.638353	2018.804376	779.972973	27.558559	55.796654	3699.907336	855.298584	10440.669241	4357.526384	549.380952	1340.642214	72.660232	79.702703	14.089704	22.743887	9660.171171	65.46332
std	3870.201484	2451.113971	929.176190	17.640364	19.804778	4850.420531	1522.431887	4023.016484	1096.696416	165.105360	677.071454	16.328155	14.722359	3.958349	12.391801	5221.768440	17.17771
min	81.000000	72.000000	35.000000	1.000000	9.000000	139.000000	1.000000	2340.000000	1780.000000	96.000000	250.000000	8.000000	24.000000	2.500000	0.000000	3186.000000	10.00000
25%	776.000000	604.000000	242.000000	15.000000	41.000000	992.000000	95.000000	7320.000000	3597.000000	470.000000	850.000000	62.000000	71.000000	11.500000	13.000000	6751.000000	53.00000
50%	1558.000000	1110.000000	434.000000	23.000000	54.000000	1707.000000	353.000000	9990.000000	4200.000000	500.000000	1200.000000	75.000000	82.000000	13.600000	21.000000	8377.000000	65.00000
75%	3624.000000	2424.000000	902.000000	35.000000	69.000000	4005.000000	967.000000	12925.000000	5050.000000	600.000000	1700.000000	85.000000	92.000000	16.500000	31.000000	10830.000000	78.00000
max	48094.000000	26330.000000	6392.000000	96.000000	100.000000	31643.000000	21836.000000	21700.000000	8124.000000	2340.000000	6800.000000	103.000000	100.000000	39.800000	64.000000	56233.000000	118.00000

pd.plotting.scatter_matrix(college[['Top10perc', 'Apps', 'Enroll']])

array([[<Axes: xlabel='Top10perc', ylabel='Top10perc'>,
        <Axes: xlabel='Apps', ylabel='Top10perc'>,
        <Axes: xlabel='Enroll', ylabel='Top10perc'>],
       [<Axes: xlabel='Top10perc', ylabel='Apps'>,
        <Axes: xlabel='Apps', ylabel='Apps'>,
        <Axes: xlabel='Enroll', ylabel='Apps'>],
       [<Axes: xlabel='Top10perc', ylabel='Enroll'>,
        <Axes: xlabel='Apps', ylabel='Enroll'>,
        <Axes: xlabel='Enroll', ylabel='Enroll'>]], dtype=object)

college.boxplot('Outstate','Private')

college['Elite'] = pd.cut(college['Top10perc'], [0, 0.5, 1], labels = ['No', 'Yes'])
print(college['Elite'].value_counts())
college.boxplot('Outstate', 'Elite')

Elite
Yes    3
No     0
Name: count, dtype: int64

import matplotlib.pyplot as plt
fig, ax = plt.subplots(2, 2)
college.hist('Enroll', bins = 3, ax=ax[0,0])
college.hist('PhD', bins = 5, ax=ax[0,1])
college.hist('Books', bins = 8, ax=ax[1,0])
college.hist('Expend', bins = 14,  ax = ax[1,1])

array([<Axes: title={'center': 'Expend'}>], dtype=object)

9.

import pandas as pd

auto = pd.read_csv('data/Auto.csv')
auto.head()

	mpg	cylinders	displacement	horsepower	weight	acceleration	year	origin	name
0	18.0	8	307.0	130	3504	12.0	70	1	chevrolet chevelle malibu
1	15.0	8	350.0	165	3693	11.5	70	1	buick skylark 320
2	18.0	8	318.0	150	3436	11.0	70	1	plymouth satellite
3	16.0	8	304.0	150	3433	12.0	70	1	amc rebel sst
4	17.0	8	302.0	140	3449	10.5	70	1	ford torino

auto.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 397 entries, 0 to 396
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           397 non-null    float64
 1   cylinders     397 non-null    int64  
 2   displacement  397 non-null    float64
 3   horsepower    397 non-null    object 
 4   weight        397 non-null    int64  
 5   acceleration  397 non-null    float64
 6   year          397 non-null    int64  
 7   origin        397 non-null    int64  
 8   name          397 non-null    object 
dtypes: float64(3), int64(4), object(2)
memory usage: 28.0+ KB

auto.nunique().sort_values(ascending=False)

weight          350
name            304
mpg             129
acceleration     95
horsepower       94
displacement     82
year             13
cylinders         5
origin            3
dtype: int64

print('Name, year, cylinders, and origin can be seen as categorical values. \n"Year" however can be treated a discreet numerical data. \n"Horsepower" has its type misclassified.')

Name, year, cylinders, and origin can be seen as categorical values. 
"Year" however can be treated a discreet numerical data. 
"Horsepower" has its type misclassified.

auto['horsepower'] = pd.to_numeric(auto['horsepower'], errors='coerce')
auto.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 397 entries, 0 to 396
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           397 non-null    float64
 1   cylinders     397 non-null    int64  
 2   displacement  397 non-null    float64
 3   horsepower    392 non-null    float64
 4   weight        397 non-null    int64  
 5   acceleration  397 non-null    float64
 6   year          397 non-null    int64  
 7   origin        397 non-null    int64  
 8   name          397 non-null    object 
dtypes: float64(4), int64(4), object(1)
memory usage: 28.0+ KB

for col in auto.columns:
    if 'int' in str(auto[col].dtype) or 'float' in str(auto[col].dtype):
        range = auto[col].max() - auto[col].min()
        print(col, '_range = ', range, sep='')

mpg_range = 37.6
cylinders_range = 5
displacement_range = 387.0
horsepower_range = 184.0
weight_range = 3527
acceleration_range = 16.8
year_range = 12
origin_range = 2

for col in auto.columns:
    if 'int' in str(auto[col].dtype) or 'float' in str(auto[col].dtype):
        col_std = auto[col].std().round(2)
        col_mean = auto[col].mean().round(2)
        print(f'{col}_std = {col_std}, {col}_mean = {col_mean}')

mpg_std = 7.83, mpg_mean = 23.52
cylinders_std = 1.7, cylinders_mean = 5.46
displacement_std = 104.38, displacement_mean = 193.53
horsepower_std = 38.49, horsepower_mean = 104.47
weight_std = 847.9, weight_mean = 2970.26
acceleration_std = 2.75, acceleration_mean = 15.56
year_std = 3.69, year_mean = 75.99
origin_std = 0.8, origin_mean = 1.57

auto_truncated = auto.drop(auto.iloc[9:85].index)

for col in auto_truncated.columns:
    if 'int' in str(auto_truncated[col].dtype) or 'float' in str(auto_truncated[col].dtype):
        col_std = auto_truncated[col].std().round(2)
        col_mean = auto_truncated[col].mean().round(2)
        print(f'{col}_std_truncted = {col_std}, {col}_mean_truncated = {col_mean}')

mpg_std_truncted = 7.91, mpg_mean_truncated = 24.44
cylinders_std_truncted = 1.65, cylinders_mean_truncated = 5.37
displacement_std_truncted = 99.64, displacement_mean_truncated = 187.05
horsepower_std_truncted = 35.9, horsepower_mean_truncated = 100.96
weight_std_truncted = 810.64, weight_mean_truncated = 2933.96
acceleration_std_truncted = 2.68, acceleration_mean_truncated = 15.72
year_std_truncted = 3.11, year_mean_truncated = 77.15
origin_std_truncted = 0.82, origin_mean_truncated = 1.6

import matplotlib.pyplot as plt
import seaborn as sns

sns.pairplot(auto, height=1, plot_kws= {'linewidth':0, 's':5})

sns.heatmap(auto.corr(numeric_only=True), annot=True)

print('Notice while mpg is negatively correlated with most variables, a non-linear fit might better explain the data.\nIf we are using learning methods that make assumptions on the variance of the error term, we need to be careful about multicollinearity.')

Notice while mpg is negatively correlated with most variables, a non-linear fit might better explain the data.
If we are using learning methods that make assumptions on the variance of the error term, we need to be careful about multicollinearity.

print('We see strong correlations between "mpg" and other variables, whether postive or negative. \nWhile our correlation mapping suggests that a linear regression might do well, \nour scatter plots suggest we will benefit in exploring more flexible methods that can follow the slight non-linearity.')

We see strong correlations between "mpg" and other variables, whether postive or negative. 
While our correlation mapping suggests that a linear regression might do well, 
our scatter plots suggest we will benefit in exploring more flexible methods that can follow the slight non-linearity.

10.

from ISLP import load_data
boston = load_data('Boston')
boston.head()

	crim	zn	indus	nox	rm	age	dis	rad	tax	ptratio	lstat	medv
0	0.00632	18.0	2.31	0.538	6.575	65.2	4.0900	1	296	15.3	4.98	24.0
1	0.02731	0.0	7.07	0.469	6.421	78.9	4.9671	2	242	17.8	9.14	21.6
2	0.02729	0.0	7.07	0.469	7.185	61.1	4.9671	2	242	17.8	4.03	34.7
3	0.03237	0.0	2.18	0.458	6.998	45.8	6.0622	3	222	18.7	2.94	33.4
4	0.06905	0.0	2.18	0.458	7.147	54.2	6.0622	3	222	18.7	5.33	36.2

n_row = boston.shape[0]
n_col = boston.shape[1]
cols = ''
for col in boston.columns:
    cols += col + ' '
print(f'There are {n_row} rows.')
print(f'There are {n_col} cols.')
print(f'The rows represent each instance/observation.\nThe columns represent the following variables: {cols}.')

There are 506 rows.
There are 13 cols.
The rows represent each instance/observation.
The columns represent the following variables: crim zn indus chas nox rm age dis rad tax ptratio lstat medv .

sns.pairplot(boston, height=1, plot_kws= {'linewidth':0, 's':5})
print('We can suspect "rad", "tax", "ptratio" variables have been capped or had their missing values filled or have a strong "mode".\nTransformation of per capita crime rate can help out our model.\nTransformation of other variables as well might not be a bad idea.\nMight need to be careful with multicollinearity.')

We can suspect "rad", "tax", "ptratio" variables have been capped or had their missing values filled or have a strong "mode".
Transformation of per capita crime rate can help out our model.
Transformation of other variables as well might not be a bad idea.
Might need to be careful with multicollinearity.

sns.heatmap(boston.corr(numeric_only=True), annot=True)
print('Many variables promise to be associated with per cpaita crime rate. \n"nox", "age", and "lstat" appear to be positively correlated, \nwhile we should be creaful about claiming "rad" and "tax" as good predictors as the story person correlation and our scatter plots tell as are not the same.\n"medv", "dis", and "rm" appear to be negatively correlated with per capita crime rate.')

Many variables promise to be associated with per cpaita crime rate. 
"nox", "age", and "lstat" appear to be positively correlated, 
while we should be creaful about claiming "rad" and "tax" as good predictors as the story person correlation and our scatter plots tell as are not the same.
"medv", "dis", and "rm" appear to be negatively correlated with per capita crime rate.

boston.sort_values('crim', ascending=False)['crim']

380    88.97620
418    73.53410
405    67.92080
410    51.13580
414    45.74610
         ...   
55      0.01311
341     0.01301
285     0.01096
284     0.00906
0       0.00632
Name: crim, Length: 506, dtype: float64

boston.sort_values('tax', ascending=False)['tax']

492    711
491    711
489    711
490    711
488    711
      ... 
121    188
120    188
124    188
126    188
353    187
Name: tax, Length: 506, dtype: int64

boston.sort_values('ptratio', ascending=False)['ptratio']

355    22.0
354    22.0
129    21.2
128    21.2
133    21.2
       ... 
263    13.0
257    13.0
198    12.6
197    12.6
196    12.6
Name: ptratio, Length: 506, dtype: float64

for col in boston.columns:
    if 'int' in str(boston[col].dtype) or 'float' in str(boston[col].dtype):
        range = boston[col].max() - boston[col].min()
        print(col, '_range = ', range, sep='')

crim_range = 88.96988
zn_range = 100.0
indus_range = 27.279999999999998
chas_range = 1
nox_range = 0.486
rm_range = 5.218999999999999
age_range = 97.1
dis_range = 10.9969
rad_range = 23
tax_range = 524
ptratio_range = 9.4
lstat_range = 36.24
medv_range = 45.0

boston.value_counts('chas')

chas
0    471
1     35
Name: count, dtype: int64

int(boston['ptratio'].median())

min_medv = boston.medv.min()
boston[boston['medv'] == min_medv]

	crim	zn	indus	chas	nox	rm	age	dis	rad	tax	ptratio	lstat	medv
398	38.3518	0.0	18.1	0	0.693	5.453	100.0	1.4896	24	666	20.2	30.59	5.0
405	67.9208	0.0	18.1	0	0.693	5.683	100.0	1.4254	24	666	20.2	22.98	5.0

len(boston[boston['rm'] > 7].index)

len(boston[boston['rm'] > 8].index)

3.7 Exercises

Conceptual

1.

Intercept: \[\begin{aligned} H_0 &: \beta_0 = 0 \\ H_1 &: \beta_0 \neq 0 \end{aligned}\] Decision: Reject the null hypothesis. The p-value suggests that the intercept is significant. When all advertising forms are zero there will \(2.939 \times 1000\) units of sale.

TV: \[\begin{aligned} H_0 &: \beta_1 = 0 \\ H_1 &: \beta_1 \neq 0 \end{aligned}\] Decision: Reject the null hypothesis. The p-value suggests TV advertising is a significant predictor of sales. Holding all other forms of advertising constant, a one unit increase in TV advertising results in an average increase of sales by \(0.046 \times 1000\) units.

Radio: \[\begin{aligned} H_0 &: \beta_2 = 0 \\ H_1 &: \beta_2 \neq 0 \end{aligned}\] Decision: Reject the null hypothesis. The p-value suggests radio advertising is a significant predictor of sales. Holding all other forms of advertising constant, a one unit increase is radio advertising results in an average increase of sales by \(0.189 \times 1000\) units.

Newspaper: \[\begin{aligned} H_0 &: \beta_3 =0 \\ H_1 &: \beta_3 \neq 0 \end{aligned}\] Decision: Fail to reject the null hypothesis. The p-value suggests newspaper advertising is not a significant predictors of sales at 95% confidence level. There is no point in interpreting a non-significant predictor’s coeffficient.

2.

They are both based on the k-nearest observations and assignment of a value based on the values of the nearest observations to a newcoming value, yet the KNN classifier can only designate one of provided discreet categorical labels to the newcomer. In contrast, the KNN regression can assign any value along the contiuous scale, as long as it is the mean of the k-nearest observations (assuming we are using mean).

The KNN regression uses mean as a way to designate values to a newcomer. The KKN classifier uses conditional prbability (mode).

Of course, the KNN classifier and KNN regression are different in the common ways regression differs from classification. Classification produces decision boundaries, while regression produces curves. Classfication is evaluated with precision, recall, etc. Rgression is evaluated by RMSE, MAE, etc.

3.

The correct answer: iii.

Reason: For fixed vlaues of IQ and GPA, the average starting salary (in thousands) cahnges by \(35 + (-10 \times GPA)\) for students in college, and by 0 for high school level. If the fixed GPA is below 3.5 college graduates earn more, on average, than high school graduates. If the fixed GPA is 3.5, both categories, on average, earn the same. If the fixed GPA is above 3.5 college graduates earn less, on average, than highschool graduates.

\[\begin{aligned} \hat{y} &= \hat{\beta}_0 + \hat{\beta}_1 \text{GPA} + \hat{\beta}_2 \text{IQ} + \hat{\beta}_3 \text{Level} + \hat{\beta}_4 (\text{IQ} \times \text{GPA}) + \hat{\beta}_5 (\text{GPA} \times \text{Level}) \\ &= 30 + 20(4.0) + 0.07(110) + 35(1) + 0.01(110 \times 4.0) - 10(4.0 \times 1) \\ &= 30 + 80 + 7.7 + 35 + 4.4 - 40 \\ &= 117.1 (in \: thousands) \end{aligned}\]

False.

Reason: We can’t deduce about the presence of an interaction effect using only the coefficient values. We need to test significance of the coefficient properly. If the standard error of the coefficient is really small, even small coefficient values can be deemed significant.

4.

For the training set, I expect the cubic regression to be lower. The cubic regression has more flexibility, and with the truly linear model underneath, the cubic and quadratic term will most likely start following the noise in the data rahter than the true pattern. This will move some of the error that should have been in the RSS to the ESS (explained sum square/ error explained by the regression), thus reducing the RSS.

I expect the case to be the opposite for the test set. The linear model is most likely to result in a lower RSS than the cubic regression. This is comes as a consequence of the cubic regression following the noise of the training data rather than the linearity of the underlying model.

Even though one can make the argument that we don’t have enough information since we don’t know the actual form of non-linearity in the underlying model, I think there is enough information to make an expected guess. As a results of its flexibility, we can expect the cubic model to have a better chance of approximating the non-linearity in the underlying patter, resulting lower RSS.

Same answer as c (epxected lower test RSS for cubic regression).

5.

\[\begin{aligned} \hat{y}_i &= x_i \hat{\beta} \\ \hat{\beta} &= \frac{\sum_{j=1}^n x_j y_j}{\sum_{k=1}^n x_k^2} \: (with \: indexing \: changed \: to \: remove \: comfusion) \\ \end{aligned}\] \[\begin{aligned} \hat{y}_i &= x_i \cdot \frac{\sum_{j=1}^n x_j y_j}{\sum_{k=1}^n x_k^2} \\ &= \frac{\sum_{j=1}^n x_i x_j y_j}{\sum_{k=1}^n x_k^2}\\ &= \sum_{i'=1}^n \left(\frac{x_i x_{i'}}{\sum_{k=1}^n x_k^2}\right) \cdot y_{i'}\\ &= \sum_{i'=1}^n a_{i'}y_{i'} \quad \text{where,} \: a_{i'} = \frac{x_i x_{i'}}{\sum_{k=1}^n x_k^2} \end{aligned}\]

6.

\[\begin{aligned} \hat{y_i} &= \hat{\beta_0} + x_i \hat{\beta_1} \\ &= \left(\bar{y} - \hat{\beta_1} \bar{x} \right) + x_i \hat{\beta_1} \end{aligned}\]

Taking \(x_i = \bar{x}\): \[\begin{aligned} \hat{y_i} &= \bar{y} - \hat{\beta_1} \bar{x} + \hat{\beta_1} \bar{x} &=\bar{y} \end{aligned}\]

7.

\[\begin{aligned} R^2 &= 1 - \frac{\text{RSS}}{\text{TSS}} \\ &= \frac{\text{ESS}}{\text{TSS}} \\ &= \frac{\sum_{i=1}^n (\hat{y}_i - \bar{y})^2}{\sum_{i=1}^n (y_i - \bar{y})^2} \\[0.8em] &\text{since } \bar{y} = 0, \\[0.8em] &= \frac{\sum_{i=1}^n \hat{y}_i^2}{\sum_{i=1}^n y_i^2} \\[0.8em] &\text{and } \hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_i, \\ &\text{but } \bar{x} = 0 \;\; \Rightarrow \;\; \hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x} = 0, \\ &\therefore \hat{y}_i = \hat{\beta}_1 x_i \\[0.8em] &\text{so} \\[0.6em] R^2 &= \frac{\sum_{i=1}^n (\hat{\beta}_1 x_i)^2}{\sum_{i=1}^n y_i^2} \\ &= \frac{\hat{\beta}_1^2 \sum_{i=1}^n x_i^2}{\sum_{i=1}^n y_i^2} \\[0.8em] &\text{and since (when }\bar{x}=0, \: \bar{y}=0\text{)} \\[0.8em] \hat{\beta}_1 &= \frac{\sum_{i=1}^n x_i y_i}{\sum_{i=1}^n x_i^2}, \\[0.8em] &\text{therefore} \\[0.8em] R^2 &= \frac{\left( \frac{\sum x_i y_i}{\sum x_i^2} \right)^2 \sum x_i^2}{\sum y_i^2} \\ &= \frac{\left( \sum x_i y_i \right)^2}{\left( \sum x_i^2 \right) \left( \sum y_i^2 \right)} \\[0.8em] \sqrt{R^2} &= \frac{\sum_{i=1}^n x_i y_i}{\sqrt{\sum_{i=1}^n x_i^2} \: \sqrt{\sum_{i=1}^n y_i^2}} \\ &= \frac{\sum_{i=1}^n \left(x_i - \bar{x} \right) \left(y_i - \bar{y} \right)}{\sqrt{\sum_{i=1}^n x_i^2} \: \sqrt{\sum_{i=1}^n y_i^2}} \end{aligned}\]

Applied

8.

from ISLP import load_data
from ISLP.models import ModelSpec as MS
from ISLP.models import summarize
import statsmodels.api as sm

auto = load_data('Auto')
y = auto['mpg']
X = MS(['horsepower']).fit_transform(auto)
model_8 = sm.OLS(y, X)
results = model_8.fit()
summarize(results)

	coef	std err	t	P>\|t\|
intercept	39.9359	0.717	55.660	0.0
horsepower	-0.1578	0.006	-24.489	0.0

Yes. The p-value for horsepower is less than 0.05, indicating sigificance of relationship at a 95% confidence level.
A one unit change in horsepower will result is an 0.1578 units dicrease in average mpg.
The relationship is negative as we can see from the coefficient.

import pandas as pd

new_data = pd.DataFrame({'horsepower' : [98]})
X_new = MS(['horsepower']).fit_transform(new_data)
prediction = results.get_prediction(X_new)
print(f"prediction: {prediction.predicted[0]}")

conf_interval = prediction.conf_int(alpha = 0.05)
pred_interval = prediction.conf_int(alpha = 0.05, obs=True)
print(f'Confidence Interval: {conf_interval} \nPrediction Interval: {pred_interval}')

prediction: 24.467077152512424
Confidence Interval: [[23.97307896 24.96107534]] 
Prediction Interval: [[14.80939607 34.12475823]]

import matplotlib.pyplot as plt

parameters = results.params
fig, ax = plt.subplots(figsize = (5,5))
ax.scatter(auto['horsepower'], auto['mpg'])
ax.axline(xy1=(0, parameters['intercept']), slope=parameters['horsepower'], color = 'red', linewidth=2)
ax.set_xlabel('horsepower')
ax.set_ylabel('mpg')
plt.show()

fig, ax = plt.subplots(figsize=(5,5))
ax.scatter(results.predict(), results.resid)
ax.axhline(0, linestyle='--')
ax.set_title('Fitted valued vs. Residuals')
ax.set_ylabel('Residuals')
ax.set_xlabel('Fitted Values')
plt.show()

We can see a clear non-linearity in the data here. The relationship looks like it is quadratic.

fig, ax = plt.subplots(figsize=(5,5))
influence = results.get_influence()
leverage = influence.hat_matrix_diag
studentized_resid = influence.resid_studentized_internal
ax.scatter(leverage, studentized_resid)
ax.axhline(0, linestyle='--') 
ax.set_title('Leverage Plot')
ax.set_xlabel('Leverage')
ax.set_ylabel('Studentized Residuals')
plt.show()

We might need to keep leverage in the back of our mind. This leverage can actually be a consequence of non-linearity.

9.

import seaborn as sns

fig, ax = plt.subplots()
sns.heatmap(auto.corr(), annot=True, cmap='RdBu', ax=ax)
ax.set_title('Correlation Matrix')
plt.show()

auto.corr()

	mpg	cylinders	displacement	horsepower	weight	acceleration	year	origin
mpg	1.000000	-0.777618	-0.805127	-0.778427	-0.832244	0.423329	0.580541	0.565209
cylinders	-0.777618	1.000000	0.950823	0.842983	0.897527	-0.504683	-0.345647	-0.568932
displacement	-0.805127	0.950823	1.000000	0.897257	0.932994	-0.543800	-0.369855	-0.614535
horsepower	-0.778427	0.842983	0.897257	1.000000	0.864538	-0.689196	-0.416361	-0.455171
weight	-0.832244	0.897527	0.932994	0.864538	1.000000	-0.416839	-0.309120	-0.585005
acceleration	0.423329	-0.504683	-0.543800	-0.689196	-0.416839	1.000000	0.290316	0.212746
year	0.580541	-0.345647	-0.369855	-0.416361	-0.309120	0.290316	1.000000	0.181528
origin	0.565209	-0.568932	-0.614535	-0.455171	-0.585005	0.212746	0.181528	1.000000

# there is no attribute called name in 'auto' dataset
predictors = auto.columns.drop(['mpg'])
X_9 = MS(predictors).fit_transform(auto)
model_9 = sm.OLS(y, X_9)
results_9 = model_9.fit()
summarize(results_9)

	coef	std err	t	P>\|t\|
intercept	-17.2184	4.644	-3.707	0.000
cylinders	-0.4934	0.323	-1.526	0.128
displacement	0.0199	0.008	2.647	0.008
horsepower	-0.0170	0.014	-1.230	0.220
weight	-0.0065	0.001	-9.929	0.000
acceleration	0.0806	0.099	0.815	0.415
year	0.7508	0.051	14.729	0.000
origin	1.4261	0.278	5.127	0.000

print(f'F-value: {results_9.fvalue} \nP-value: {results_9.f_pvalue}')

F-value: 252.42804529131908 
P-value: 2.037105930754976e-139

Yes, there is a relationship between the predictors and the response.

Even with suspected huge multicollinearity, which tends to hid significant variables, all other predictors excpet cylinders, horsepower, and acceleration appear signidicant to the model.

A one year increase, keeping all other factors constant, is associated with a 0.7508 change in mpg.

fig, ax = plt.subplots(figsize=(5,5))
ax.scatter(results_9.predict(), results_9.resid)
ax.axhline(0, linestyle='--')
ax.set_title('Fitted valued vs. Residuals')
ax.set_ylabel('Residuals')
ax.set_xlabel('Fitted Values')
plt.show()

The residual vs. fitted plots suggest that, similar to the simple linear regresion we fitted earlier, there is osme non-linearity we haven’t captured yet.

fig, ax = plt.subplots(figsize=(5,5))
influence_9 = results_9.get_influence()
leverage_9 = influence_9.hat_matrix_diag
average_leverage = len(X_9.columns) / len(X_9)
warning_leverage = 2 * average_leverage
studentized_resid_9 = influence_9.resid_studentized_internal

ax.scatter(leverage_9, studentized_resid_9)
ax.axhline(0, linestyle='--') 
ax.axvline(average_leverage, linestyle='--')
ax.axvline(warning_leverage, linestyle = '--', color='r')
ax.set_title('Leverage Plot')
ax.set_xlabel('Leverage')
ax.set_ylabel('Studentized Residuals')
plt.show()

There are some points that both exceed the average leverage and are greater that |2| studentized residuals. There are a few points that exceed similar limits for the studentized residuals while reaching past 2 times the average leverage. One point can be seen with really high leverage but within |2| studentized residuals. Outliers clearly exist, with most being below two times the average leverage. There definetely is an unusual high leverage going on here.

sns.pairplot(auto, height=1, plot_kws= {'linewidth':0, 's':5})
plt.show()