Prediction (out of sample)¶
[1]:
%matplotlib inline
[2]:
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
plt.rc("figure", figsize=(16, 8))
plt.rc("font", size=14)
Artificial data¶
[3]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1 - 5) ** 2))
X = sm.add_constant(X)
beta = [5.0, 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)
Estimation¶
[4]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.985
Model: OLS Adj. R-squared: 0.984
Method: Least Squares F-statistic: 984.9
Date: Sun, 20 Feb 2022 Prob (F-statistic): 1.01e-41
Time: 20:03:58 Log-Likelihood: 2.9617
No. Observations: 50 AIC: 2.077
Df Residuals: 46 BIC: 9.725
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 5.0797 0.081 62.681 0.000 4.917 5.243
x1 0.4844 0.012 38.754 0.000 0.459 0.510
x2 0.4564 0.049 9.289 0.000 0.357 0.555
x3 -0.0185 0.001 -16.833 0.000 -0.021 -0.016
==============================================================================
Omnibus: 3.355 Durbin-Watson: 2.189
Prob(Omnibus): 0.187 Jarque-Bera (JB): 2.665
Skew: -0.230 Prob(JB): 0.264
Kurtosis: 4.033 Cond. No. 221.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In-sample prediction¶
[5]:
ypred = olsres.predict(X)
print(ypred)
[ 4.6178953 5.06906139 5.4843106 5.83877051 6.11654502 6.31332605
6.43710133 6.50684205 6.54938593 6.5950279 6.67254268 6.80445719
7.00334947 7.26978209 7.59220977 7.94887622 8.31138862 8.64938252
8.93551057 9.14993568 9.28359017 9.33966554 9.33308815 9.28806685
9.23411553 9.20120409 9.21483442 9.29185058 9.43767133 9.64539983
9.8969568 10.1660509 10.42249828 10.63718175 10.78683462 10.85786335
10.84858097 10.76948419 10.64152886 10.49268697 10.35335091 10.25134014
10.20733047 10.23145713 10.32164954 10.46397155 10.63491107 10.80524303
10.94483104 11.02757864]
Create a new sample of explanatory variables Xnew, predict and plot¶
[6]:
x1n = np.linspace(20.5, 25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n - 5) ** 2))
Xnew = sm.add_constant(Xnew)
ynewpred = olsres.predict(Xnew) # predict out of sample
print(ynewpred)
[11.02635524 10.90451208 10.67994673 10.39344552 10.0986976 9.84914999
9.68492184 9.62298191 9.65299399 9.73984761]
Plot comparison¶
[7]:
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
ax.plot(x1, y, "o", label="Data")
ax.plot(x1, y_true, "b-", label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), "r", label="OLS prediction")
ax.legend(loc="best")
[7]:
<matplotlib.legend.Legend at 0x7ffa0c666620>

Predicting with Formulas¶
Using formulas can make both estimation and prediction a lot easier
[8]:
from statsmodels.formula.api import ols
data = {"x1": x1, "y": y}
res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()
We use the I
to indicate use of the Identity transform. Ie., we do not want any expansion magic from using **2
[9]:
res.params
[9]:
Intercept 5.079682
x1 0.484366
np.sin(x1) 0.456382
I((x1 - 5) ** 2) -0.018471
dtype: float64
Now we only have to pass the single variable and we get the transformed right-hand side variables automatically
[10]:
res.predict(exog=dict(x1=x1n))
[10]:
0 11.026355
1 10.904512
2 10.679947
3 10.393446
4 10.098698
5 9.849150
6 9.684922
7 9.622982
8 9.652994
9 9.739848
dtype: float64