Prediction (out of sample)

[1]:
%matplotlib inline
[2]:
import numpy as np
import matplotlib.pyplot as plt

import statsmodels.api as sm

plt.rc("figure", figsize=(16, 8))
plt.rc("font", size=14)

Artificial data

[3]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1 - 5) ** 2))
X = sm.add_constant(X)
beta = [5.0, 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)

Estimation

[4]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
                            OLS Regression Results
==============================================================================
Dep. Variable:                      y   R-squared:                       0.985
Model:                            OLS   Adj. R-squared:                  0.984
Method:                 Least Squares   F-statistic:                     984.9
Date:                Sun, 20 Feb 2022   Prob (F-statistic):           1.01e-41
Time:                        20:03:58   Log-Likelihood:                 2.9617
No. Observations:                  50   AIC:                             2.077
Df Residuals:                      46   BIC:                             9.725
Df Model:                           3
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          5.0797      0.081     62.681      0.000       4.917       5.243
x1             0.4844      0.012     38.754      0.000       0.459       0.510
x2             0.4564      0.049      9.289      0.000       0.357       0.555
x3            -0.0185      0.001    -16.833      0.000      -0.021      -0.016
==============================================================================
Omnibus:                        3.355   Durbin-Watson:                   2.189
Prob(Omnibus):                  0.187   Jarque-Bera (JB):                2.665
Skew:                          -0.230   Prob(JB):                        0.264
Kurtosis:                       4.033   Cond. No.                         221.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In-sample prediction

[5]:
ypred = olsres.predict(X)
print(ypred)
[ 4.6178953   5.06906139  5.4843106   5.83877051  6.11654502  6.31332605
  6.43710133  6.50684205  6.54938593  6.5950279   6.67254268  6.80445719
  7.00334947  7.26978209  7.59220977  7.94887622  8.31138862  8.64938252
  8.93551057  9.14993568  9.28359017  9.33966554  9.33308815  9.28806685
  9.23411553  9.20120409  9.21483442  9.29185058  9.43767133  9.64539983
  9.8969568  10.1660509  10.42249828 10.63718175 10.78683462 10.85786335
 10.84858097 10.76948419 10.64152886 10.49268697 10.35335091 10.25134014
 10.20733047 10.23145713 10.32164954 10.46397155 10.63491107 10.80524303
 10.94483104 11.02757864]

Create a new sample of explanatory variables Xnew, predict and plot

[6]:
x1n = np.linspace(20.5, 25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n - 5) ** 2))
Xnew = sm.add_constant(Xnew)
ynewpred = olsres.predict(Xnew)  # predict out of sample
print(ynewpred)
[11.02635524 10.90451208 10.67994673 10.39344552 10.0986976   9.84914999
  9.68492184  9.62298191  9.65299399  9.73984761]

Plot comparison

[7]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.plot(x1, y, "o", label="Data")
ax.plot(x1, y_true, "b-", label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), "r", label="OLS prediction")
ax.legend(loc="best")
[7]:
<matplotlib.legend.Legend at 0x7ffa0c666620>
../../../_images/examples_notebooks_generated_predict_12_1.png

Predicting with Formulas

Using formulas can make both estimation and prediction a lot easier

[8]:
from statsmodels.formula.api import ols

data = {"x1": x1, "y": y}

res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()

We use the I to indicate use of the Identity transform. Ie., we do not want any expansion magic from using **2

[9]:
res.params
[9]:
Intercept           5.079682
x1                  0.484366
np.sin(x1)          0.456382
I((x1 - 5) ** 2)   -0.018471
dtype: float64

Now we only have to pass the single variable and we get the transformed right-hand side variables automatically

[10]:
res.predict(exog=dict(x1=x1n))
[10]:
0    11.026355
1    10.904512
2    10.679947
3    10.393446
4    10.098698
5     9.849150
6     9.684922
7     9.622982
8     9.652994
9     9.739848
dtype: float64