1 | import scipy |
1 | data=pd.read_csv('./data/prostate.data',delimiter='\t',index_col=0) |
lcavol | lweight | age | lbph | svi | lcp | gleason | pgg45 | lpsa | train | |
---|---|---|---|---|---|---|---|---|---|---|
1 | -0.579818 | 2.769459 | 50 | -1.386294 | 0 | -1.386294 | 6 | 0 | -0.430783 | T |
2 | -0.994252 | 3.319626 | 58 | -1.386294 | 0 | -1.386294 | 6 | 0 | -0.162519 | T |
3 | -0.510826 | 2.691243 | 74 | -1.386294 | 0 | -1.386294 | 7 | 20 | -0.162519 | T |
4 | -1.203973 | 3.282789 | 58 | -1.386294 | 0 | -1.386294 | 6 | 0 | -0.162519 | T |
5 | 0.751416 | 3.432373 | 62 | -1.386294 | 0 | -1.386294 | 6 | 0 | 0.371564 | T |
The correlation matrix of the predictors:
1 | data.corr() |
lcavol | lweight | age | lbph | svi | lcp | gleason | pgg45 | lpsa | |
---|---|---|---|---|---|---|---|---|---|
lcavol | 1.000000 | 0.280521 | 0.225000 | 0.027350 | 0.538845 | 0.675310 | 0.432417 | 0.433652 | 0.734460 |
lweight | 0.280521 | 1.000000 | 0.347969 | 0.442264 | 0.155385 | 0.164537 | 0.056882 | 0.107354 | 0.433319 |
age | 0.225000 | 0.347969 | 1.000000 | 0.350186 | 0.117658 | 0.127668 | 0.268892 | 0.276112 | 0.169593 |
lbph | 0.027350 | 0.442264 | 0.350186 | 1.000000 | -0.085843 | -0.006999 | 0.077820 | 0.078460 | 0.179809 |
svi | 0.538845 | 0.155385 | 0.117658 | -0.085843 | 1.000000 | 0.673111 | 0.320412 | 0.457648 | 0.566218 |
lcp | 0.675310 | 0.164537 | 0.127668 | -0.006999 | 0.673111 | 1.000000 | 0.514830 | 0.631528 | 0.548813 |
gleason | 0.432417 | 0.056882 | 0.268892 | 0.077820 | 0.320412 | 0.514830 | 1.000000 | 0.751905 | 0.368987 |
pgg45 | 0.433652 | 0.107354 | 0.276112 | 0.078460 | 0.457648 | 0.631528 | 0.751905 | 1.000000 | 0.422316 |
lpsa | 0.734460 | 0.433319 | 0.169593 | 0.179809 | 0.566218 | 0.548813 | 0.368987 | 0.422316 | 1.000000 |
Scatterplot matrix:(showing every pairwise plot between the variables)We see that svi is a binary variable, and gleason is an ordered categorical variable.
1 | pd.plotting.scatter_matrix(data,figsize=(12,12)) |

png
Simple univariate regression
1 | y=data['lpsa'] |
(67, 9)
ˆβ=(XTX)−1XTy Fitted values at the training inputs: ˆy=X(XTX)−1XTy
1 | def ols(X_,y_): |
(67,)
Unbiased estimate of σ2: ^σ2=1N−p−1N∑i=1(yi−^yi)2
1 | sigma2_hat=sum((y_train-y_hat)**2)/(len(y_train)-1-len(X.columns)) |
0.5073514562053173
The variance–covariance matrix of the least squares parameter estimates: Var(ˆβ)=(XTX)−1σ2)
1 | std_w=np.sqrt(np.diag(np.linalg.inv(X_train.T.dot(X_train)))*sigma2_hat) |
array([ 0.08931498, 0.12597461, 0.095134 , 0.10081871, 0.10169077, 0.1229615 , 0.15373073, 0.14449659, 0.1528197 ])
1 | table=pd.DataFrame(columns=['Term','Coefficient','Std. Error','Z Score']) |
Term | Coefficient | Std. Error | Z Score | |
---|---|---|---|---|
0 | Intercept | 2.464933 | 0.089315 | 27.598203 |
1 | lcavol | 0.676016 | 0.125975 | 5.366290 |
2 | lweight | 0.261694 | 0.095134 | 2.750789 |
3 | age | -0.140734 | 0.100819 | -1.395909 |
4 | lbph | 0.209061 | 0.101691 | 2.055846 |
5 | svi | 0.303623 | 0.122962 | 2.469255 |
6 | lcp | -0.287002 | 0.153731 | -1.866913 |
7 | gleason | -0.021195 | 0.144497 | -0.146681 |
8 | pgg45 | 0.265576 | 0.152820 | 1.737840 |
We can also test for the exclusion of a number of terms at once, using the F-statistic: F=(RSS2−RSS1)/(p1−p2)RSS1/(N−p1) RSS(β)=N∑i=1(yi−f(xi))2=N∑i=1(yi−β0−p∑j=1Xijβj)2
we consider dropping all the non-significant terms, namely age, lcp, gleason, and pgg45
1 | X2=X.drop(['age', 'lcp', 'gleason', 'pgg45'],axis=1) |
F:1.67 Pr(F(4,58)>1.67)=0.17
1 | RSS1,RSS2 |
(29.4263844599084, 32.81499474881556)
The mean prediction error on the test data is 0.521. In contrast, prediction using the mean training value of lpsa has a test error of 1.057, which is called the “base error rate.” Hence the linear model reduces the base error rate by about 50%. We will return to this example later to compare various selection and shrinkage methods.
1 | y_test_fit=X_test.dot(w) |
base_error: 1.057 prediction_error: 0.521
Best-Subset Selection
1 | import math |
1 |
|

png
The ridge regression
1 | U,D,V=np.linalg.svd(X_train, full_matrices=False) |
1 | def df(D,lam): |

png
Ref:
James, Gareth, et al. An introduction to statistical learning. Vol. 112. New York: springer, 2013.
Hastie, Trevor, et al. "The elements of statistical learning: data mining, inference and prediction." The Mathematical Intelligencer 27.2 (2005): 83-85