# Exploring a milk production Time Series

Time series models are used in a wide range of applications, particularly for forecasting, which is the goal of this example, performed in four steps:

– Explore the characteristics of the time series data.

– Decompose the time series into trend, seasonal components, and remainder components.

– Apply time series models.

– Forecast the production for a 12 month period.

Part I of this mini-series explored the basic terms of time series analysis.

# Load and clean the data

The dataset is the **production** amount of several **diary products** in California, month by month, for 18 years.

Our goal: forecast the next year production for one of those products: **milk.**

You can follow along with the associated notebook in GitHub.

As usual, first we load the data into a Pandas dataframe.

import pandas as pd X = pd.read_csv('../../datasets/cadairydata.csv', index_col=0) X.head()

X.info()

Int64Index: 228 entries, 1 to 228 Data columns (total 8 columns): Year.Month 228 non-null float64 Month.Number 228 non-null int64 Year 228 non-null int64 Month 228 non-null object Cotagecheese.Prod 228 non-null float64 Icecream.Prod 228 non-null float64 Milk.Prod 228 non-null float64 N.CA.Fat.Price 228 non-null float64 dtypes: float64(5), int64(2), object(1) memory usage: 16.0+ KB

The first step is to have a Pandas date column which could be used as the time index.

## Add Pandas date-time column

The fastest way is to convert year and month to strings, then create a new Pandas *datetime *column that combine them.

X["Year"] = X["Year"].astype(str) X["Month.Number"] = X["Month.Number"].astype(str) X["YearMonth"] = pd.to_datetime(X["Year"] + X["Month.Number"], format="%Y%m") # new column

Now can drop the other date columns:

X.drop(["Year","Month", "Year.Month"], axis=1, inplace=True)

and I prefer to rename some of the other columns, for clarity:

X.rename(columns={'Cotagecheese.Prod':'CottageCheeseProd', 'Icecream.Prod':'IceCreamProd', 'Milk.Prod':'MilkProd', 'N.CA.Fat.Price':'Price', 'Month.Number' : 'MonthNumber'}, inplace=True) X.head()

X.tail()

X.info()

Int64Index: 228 entries, 1 to 228 Data columns (total 6 columns): MonthNumber 228 non-null object CottageCheeseProd 228 non-null float64 IceCreamProd 228 non-null float64 MilkProd 228 non-null float64 Price 228 non-null float64 YearMonth 228 non-null datetime64[ns] dtypes: datetime64[ns](1), float64(4), object(1) memory usage: 12.5+ KB

Now, we can perform both arithmetic and logical operations on datetime objects (“*YearMonth*“).

As an example, here is how to select the last 12 months of dairy data:

X[X.YearMonth > "2012-12-01"]

The Milk production should be in million of pounds.

X.isnull().values.any()

False

There are no NaNs. The dataset is ready to be explored.

# Explore the data

We concentrate now on the milk production (the *MilkProd* column).

## Create a Time Series Plot

First useful thing to do is always to plot the data:

import matplotlib.pyplot as plt %matplotlib inline

fig,ax = plt.subplots() ax.plot(X.YearMonth, X.MilkProd) ax.grid(True) ax.set_xlabel("Year") ax.set_ylabel("Milk") fig.suptitle("Time series of milk production");

For most of the time period shown, the production of milk increased year over year. However, there is a decline in milk production starting in 2009 as a results of the recession. Also, notice that this time series exhibits a strong seasonal component with an annual cycle.

Since the time series has both a visible trend and a seasonality it is clearly a **non-stationary** series.

## Statistical Properties of the Time Series

Autocorrelation is a fundamental property of time series. The **Autocorrelation Function** or ACF provides information on the dependency of the time series values of previous values.

import statsmodels.graphics.tsaplots as sm_pl sm_pl.plot_acf(X.MilkProd, lags=24);

Lags are the months in this case.

Note that the values of the ACF at the various lags decays only slowly. This indicates **there is considerable auto correlation** between the time series values at the various lags, mostly likely from the trend. Plotting a histogram provides information on the distribution of values of the time series.

fig,ax = plt.subplots() ax.hist(X.MilkProd, bins=40) ax.grid(True) ax.set_xlabel("Production") ax.set_ylabel("Frequency") fig.suptitle("Distribution of Milk Production");

The histogram of the full milk production time series shows considerable dispersion. Again this behaviour is likely **the result of the trend**.

## Stationarity: are mean and standard deviation constant over time?

From the plot is pretty clear that there is trend and seasonality, therefore is not stationary. Anyway, can be checked, via the *rolling()* function from *Pandas*:

rollingMean = X.MilkProd.rolling(window=12).mean() rollingStd = X.MilkProd.rolling(window=12).std() fig,ax1 = plt.subplots() ax1.plot(X.MilkProd, color="black", label="Milk production") ax1.plot(rollingMean, color="red", label="Average") ax1.grid(True) ax1.set_xlabel("Time in months") ax1.set_ylabel("Milk") ax1.legend(loc="upper left") ax2 = ax1.twinx() # Create a twin Axes sharing the xaxis ax2.plot(rollingStd, color = "blue", label = "StD") ax2.set_ylabel("Standard Deviation") ax2.legend(loc="lower right") ax2.set_ylim(0,0.5) fig.suptitle("Milk - Average and Standard Deviation");

The mean is clearly increasing over time (the production is rising) but also the standard deviation of milk production increased in the time frame.

# Decomposition of the Time Series

Time series are typically decomposed into three components: trend, seasonal, and the remainders, or residuals.

Trend can be modelled by several methods. We will start by decomposing the time series using a simple moving average model.

## Trend

The code in the cell below uses moving window method to compute the average of the time series over specified span, or order of the operator. As the moving window operator moves over the data, the average of the values in the windows is calculated.

import numpy as np def movingAverage(values, order): end = len(values) out = np.zeros(len(values)) out[0] = values[1] for i in range(1,end): if (i - order <= 1): j = 1 else: j += 1 out[i] = sum(values[j-1:i]) / (i-j+1) return out

One of the best ways to make a series stationary on variance is through transforming the original series through log transform. We go back to our original data series and add a new column, the logarithm of the milk production:

X['LogMilk'] = pd.Series(np.log(X.MilkProd), index = X.index)

Now we perform a 12 months moving average, to extract the trend:

trendM = movingAverage(X.LogMilk, 12)

fig,ax = plt.subplots()

ax.plot(X.LogMilk, color="black", label="Milk production")

ax.plot(trendM, color="blue", label="Trend")

ax.grid(True)

ax.set_xlabel("Time in months")

ax.set_ylabel("Log(Milk)")

ax.legend(loc="upper left")

fig.suptitle("Milk & trend – log");

### Get the Trend with Lowess

Having tried a simple moving average decomposition, we will now use a Lowess model to determine the trend. Lowess is a sophisticated non-linear regression. The *statsmodels* package includes a function to calculate the Lowess regression:

import statsmodels.nonparametric.smoothers_lowess as sm trendL = sm.lowess(X.LogMilk, X.YearMonth, frac=0.25,return_sorted=False) fig,ax = plt.subplots() ax.plot(X.YearMonth, X.LogMilk) ax.plot(X.YearMonth, trendM, label='Trend from Moving Average') ax.plot(X.YearMonth, trendL, label="Trend from Lowess") ax.set_xlabel("Year") ax.set_ylabel("Log(Milk)") ax.legend(loc="lower right") fig.suptitle("Trends extracted from milk production");

The time series charts show the original time series along with the two trends. The trend from Lowess regression is a bit smoother than the one was obtained with the simple moving average decomposition.

## Seasonality

Now we apply a linear regression on the Milk production using the months as variables to discover the seasonality. But first need to convert the months into categorical variables using the One Hot Encoder (OHE) approach.

monthsOHE = pd.get_dummies(X.MonthNumber)

The order doesn’t matter in this case (being numerical, October and November come before February but it’s ok).

prodWithoutTrend = X.LogMilk - trendM # the difference

fig,ax = plt.subplots() ax.plot(prodWithoutTrend) ax.grid(True) ax.set_xlabel("Time in months") ax.set_ylabel("Milk production without trend") fig.suptitle("Difference looks more stationary");

The milk production – when the trend is removed – seems more stationary, without any immediate pattern. We create now a temporary data frame to perform the regression:

# a new tmp data frame for decomposition decX = pd.DataFrame({'LogMilkNoTrend': prodWithoutTrend}) decX = decX.join(monthsOHE)

The regression is applied to the production without trend:

from sklearn import linear_model model = linear_model.LinearRegression(fit_intercept = False) model.fit(monthsOHE, decX.LogMilkNoTrend)

LinearRegression(copy_X=True, fit_intercept=False, n_jobs=1, normalize=False)

It’s a model with 12 variables (one for each month, keep in mind that they are not in chronological order):

model.coef_

array([ 0.02471208, 0.01099585, -0.02349834, 0.01714495, -0.04996016, 0.06783954, 0.03872409, 0.06733682, 0.01530435, 0.02160618, 0.01572798, -0.02670753])

We get the seasonality by predicting the value month by month:

seasonalM = model.predict(monthsOHE)

## Residuals

If we remove the seasonality from the difference between milk production and the trend, we will get the residuals or remainders:

remainderM = prodWithoutTrend - seasonalM

Finally, we put all together in a data frame that contains all decomposition elements: trend, seasonality and residuals

# a data frame with all decompositions decomposed = pd.DataFrame({'t': trendM, 's': seasonalM, 'r': remainderM}) decomposed.head()

r | s | t | |
---|---|---|---|

1 | -0.024712 | 0.024712 | 0.747635 |

2 | -0.039119 | -0.049960 | 0.747635 |

3 | 0.000099 | 0.067840 | 0.703096 |

4 | -0.008344 | 0.038724 | 0.725742 |

5 | -0.000018 | 0.067337 | 0.733337 |

fig,ax = plt.subplots(4, sharex=True, figsize=(15,15)) ax[0].plot(X.MilkProd) ax[0].set_title('Original') ax[1].plot(decomposed.t) ax[1].set_title('Trend') ax[2].plot(decomposed.s) ax[2].set_title('Seasonality') ax[3].plot(decomposed.r) ax[3].set_title('Residuals');

As we have seen, the trend is of an increasing milk production, beside the last months when it is stable; while there is a clear yearly seasonality.

The remainders in an ideal situation are like white noise, i.e. every pattern has been removed and what is left is only the random variability.

## Decomposing using statsmodels

The *statsmodels* package has also a function to decompose a time series:

from statsmodels.tsa.seasonal import seasonal_decompose # need to pass the time series as array (values) because of error # in statsmodels, should be fixed in next version 0.9 decomposition = seasonal_decompose(X.MilkProd.values, freq=12) trend = decomposition.trend seasonal = decomposition.seasonal residual = decomposition.resid

fig,ax = plt.subplots(4, sharex=True, figsize=(15,15)) ax[0].plot(X.MilkProd) ax[0].set_title('Original') ax[1].plot(trend) ax[1].set_title('Trend') ax[2].plot(seasonal) ax[2].set_title('Seasonality') ax[3].plot(residual) ax[3].set_title('Residuals');

You can see the trend and seasonal components clearly separated in the above plots. The remainder plot looks fairly random, as expected.

## Is now stationary?

The question remains, is the remainder from this decomposition stationary? To find out, we plot its *ACF*:

sm_pl.plot_acf(decomposed.r, lags=24);

sm_pl.plot_pacf(decomposed.r, lags=24);

The ACF has at least 5 significant lag values, indicating the remainder is NOT, in fact, stationary.

Compared to the behavior of the ACF for the simple moving average decomposition, the behavior of the remainder is improved.

Now we plot the histogram of the remainder:

fig,ax = plt.subplots() ax.hist(decomposed.r, bins=40) ax.grid(True) ax.set_xlabel("Residuals") ax.set_ylabel("Frequency") fig.suptitle("Distribution of Residuals");

The distribution of the remainder values is much closer to a Normal distribution than for the original time series we created earlier.

This result, combined with the ACF plot, indicates that the decomposition is effective.

We can further investigate the remainder (non-seasonal residual) component by making a box plot by month of the year:

remaindersByMonth = pd.DataFrame({'resid': decomposed.r, 'month': X["MonthNumber"]}) remaindersByMonth.boxplot(by='month');

The remainder component shows only limited variation from month to month. The differences are within the interquartile range, indicating that the seasonal model is a reasonably good fit.

# Model a tme series

Now that we have explored the decomposition of the time series we will now construct and test Autoregressive Moving Average (ARMA) models for the remainder of the time series.

We will create and test these models in three steps, first creating a moving average (MA) model, then creating an autoregressive (AR) model and finally an autoregressive moving average (ARMA) model.

## Moving Average Models

The ACF of the remainder from the decomposition of the milk production time series had at least 5 significant lag values. As an initial model, we therefore now create an MA model of order 5 and examine the model summary.

from statsmodels.tsa.arima_model import ARIMA modelMA5 = ARIMA(decomposed.r.values, order=(0,0,5)) resultsMA5 = modelMA5.fit(disp=0) # disp: do not display covergence info resultsMA5.summary()

Dep. Variable: | y | No. Observations: | 228 |
---|---|---|---|

Model: | ARMA(0, 5) | Log Likelihood | 629.357 |

Method: | css-mle | S.D. of innovations | 0.015 |

Date: | Sun, 17 Jun 2018 | AIC | -1244.713 |

Time: | 14:24:03 | BIC | -1220.708 |

Sample: | 0 | HQIC | -1235.028 |

coef | std err | z | P>|z| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|

const | -6.083e-05 | 0.003 | -0.018 | 0.985 | -0.007 | 0.006 |

ma.L1.y | 0.7585 | 0.074 | 10.264 | 0.000 | 0.614 | 0.903 |

ma.L2.y | 0.6238 | 0.086 | 7.278 | 0.000 | 0.456 | 0.792 |

ma.L3.y | 0.4338 | 0.070 | 6.205 | 0.000 | 0.297 | 0.571 |

ma.L4.y | 0.2543 | 0.073 | 3.488 | 0.001 | 0.111 | 0.397 |

ma.L5.y | 0.2421 | 0.077 | 3.139 | 0.002 | 0.091 | 0.393 |

Examining the values of the model coefficients and their standard errors (SE), we notice that the SE of the last two coefficients is the same order than the value of the coefficient itself. This may indicate that the value of this coefficient is poorly determined and could likely be set to zero.

The foregoing result indicates that the order of the MA model could be reduced. Generally, the order of an MA model is reduced in unit steps until all the coefficients appear to be significant.

We try with 3 coefficients and examine the model summary.

modelMA3 = ARIMA(decomposed.r.values, order=(0,0,3)) resultsMA3 = modelMA3.fit(disp=0) resultsMA3.summary()

Dep. Variable: | y | No. Observations: | 228 |
---|---|---|---|

Model: | ARMA(0, 3) | Log Likelihood | 623.018 |

Method: | css-mle | S.D. of innovations | 0.016 |

Date: | Sun, 17 Jun 2018 | AIC | -1236.036 |

Time: | 14:24:53 | BIC | -1218.889 |

Sample: | 0 | HQIC | -1229.118 |

coef | std err | z | P>|z| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|

const | -2.526e-05 | 0.003 | -0.009 | 0.993 | -0.006 | 0.006 |

ma.L1.y | 0.8000 | 0.063 | 12.608 | 0.000 | 0.676 | 0.924 |

ma.L2.y | 0.6155 | 0.061 | 10.074 | 0.000 | 0.496 | 0.735 |

ma.L3.y | 0.3563 | 0.065 | 5.459 | 0.000 | 0.228 | 0.484 |

The AIC is slightly lower and the standard error compared to the magnitude of the coefficients is also slightly improved; indicates that the order of the model is more reasonable but probably not yet perfect.

To test how well this model fits the data, and results in a stationary result, we plot the ACF of the residuals of the MA(3) model:

sm_pl.plot_acf(resultsMA3.resid, lags=24);

sm_pl.plot_pacf(resultsMA3.resid, lags=24);

## Autoregressive Models

Next, we test an autoregressive model. The PACF of the reminder indicates that an AR model might not be the best choice. None the less, a low order AR model might fit these data.

In general, AR orders will tend to present themselves by a sharp cutoff in the PACF plot and a slow trending or sinusoidal degradation in the ACF plot. The opposite is usually true for MA orders.

modelAR2 = ARIMA(decomposed.r.values, order=(2,0,0)) resultsAR2 = modelAR2.fit(disp=0) resultsAR2.summary()

Dep. Variable: | y | No. Observations: | 228 |
---|---|---|---|

Model: | ARMA(2, 0) | Log Likelihood | 632.334 |

Method: | css-mle | S.D. of innovations | 0.015 |

Date: | Sun, 17 Jun 2018 | AIC | -1256.668 |

Time: | 14:25:50 | BIC | -1242.951 |

Sample: | 0 | HQIC | -1251.133 |

coef | std err | z | P>|z| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|

const | -0.0005 | 0.005 | -0.099 | 0.921 | -0.011 | 0.010 |

ar.L1.y | 0.7742 | 0.066 | 11.709 | 0.000 | 0.645 | 0.904 |

ar.L2.y | 0.0389 | 0.066 | 0.584 | 0.560 | -0.091 | 0.169 |

The AIC value is the worst among the models tried.

Examining the values of the coefficients and their standard errors, we note that the standard error of the second coefficient is actually bigger than the coefficient. Clearly, the AR(2) model is over parameterized.

Next, we try an AR(1) model.

modelAR1 = ARIMA(decomposed.r.values, order=(1,0,0)) resultsAR1 = modelAR1.fit(disp=0) resultsAR1.summary()

Dep. Variable: | y | No. Observations: | 228 |
---|---|---|---|

Model: | ARMA(1, 0) | Log Likelihood | 632.163 |

Method: | css-mle | S.D. of innovations | 0.015 |

Date: | Sun, 17 Jun 2018 | AIC | -1258.327 |

Time: | 14:26:22 | BIC | -1248.039 |

Sample: | 0 | HQIC | -1254.176 |

coef | std err | z | P>|z| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|

const | -0.0004 | 0.005 | -0.088 | 0.930 | -0.010 | 0.009 |

ar.L1.y | 0.8055 | 0.039 | 20.777 | 0.000 | 0.729 | 0.881 |

The standard error of the AR(1) model is an order of magnitude less than the value of the coefficient, which is promising.

sm_pl.plot_acf(resultsAR1.resid, lags=24);

sm_pl.plot_pacf(resultsAR1.resid, lags=24);

Note that only the 0 lag of the ACF is significant and that there are no significant lags for the PACF. These observations indicate that the AR(1) model is a good fit. Compare these results to those of the MA(3) model, noting that they are nearly identical. Either the MA(3) or AR(1) model could be a good choice for these data.

## Autoregressive Moving Average (ARMA) Models

We have found that both MA(3) and AR(1) models are possibly good fits to the remainder series.

We will now investigate the use of autoregressive Integrated moving average (ARIMA) models on the remainder series, a combination of both MA and AR models.

As a starting point we try an ARIMA(1,3) model.

modelARMA = ARIMA(decomposed.r.values, order=(1,0,3)) resultsARMA = modelARMA.fit(disp=0) resultsARMA.summary()

Dep. Variable: | y | No. Observations: | 228 |
---|---|---|---|

Model: | ARMA(1, 3) | Log Likelihood | 632.544 |

Method: | css-mle | S.D. of innovations | 0.015 |

Date: | Sun, 17 Jun 2018 | AIC | -1253.087 |

Time: | 14:42:32 | BIC | -1232.511 |

Sample: | 0 | HQIC | -1244.785 |

coef | std err | z | P>|z| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|

const | -0.0005 | 0.005 | -0.092 | 0.927 | -0.010 | 0.009 |

ar.L1.y | 0.8033 | 0.069 | 11.705 | 0.000 | 0.669 | 0.938 |

ma.L1.y | -0.0279 | 0.097 | -0.287 | 0.774 | -0.218 | 0.163 |

ma.L2.y | 0.0548 | 0.084 | 0.649 | 0.517 | -0.111 | 0.220 |

ma.L3.y | -0.0116 | 0.081 | -0.144 | 0.886 | -0.170 | 0.147 |

In each case, the standard error is of the same order of magnitude as the value of the coefficient, indicating this model is a poor fit to the data.

### Automatic search for the best ARIMA parameters

There are many guidelines and best practices to get the most suitable parameters for an ARIMA model, yet the correct parametrisation of ARIMA models can be a painstaking manual process that requires domain expertise and time. Other statistical programming languages such as R provide automated ways to solve this issue, but not yet in Python.

We can solve this issue by writing Python code to programmatically select the optimal parameter values for our ARIMA(p,d,q) time series model.

We would perform a grid search, a sort of hyper-parameter tuning. We can loop through non-trivial values of the ARIMA orders and see which model has the lowest *Akaike Information Criterion* (AIC).

import itertools # Define the p and q parameters to take any value between 0 and 3 p = q = range(0, 4) # Generate all different combinations of p, d=0 and q triplets pdq = [(a,0,b) for a,b in itertools.product(p, q)]

best_score = np.inf best_pdq = None temp_model = None for params in pdq: try: temp_model = ARIMA(decomposed.r.values, order=params) resultsARMA = temp_model.fit(disp=0) if resultsARMA.aic NEW best AIC: ", best_score) except: #print("Unexpected error:", sys.exc_info()[0]) continue

(0, 0, 0) -> NEW best AIC: -1020.86399991 (0, 0, 1) -> NEW best AIC: -1157.4900345419903 (0, 0, 2) -> NEW best AIC: -1207.5618740712084 (0, 0, 3) -> NEW best AIC: -1236.036064300405 (1, 0, 0) -> NEW best AIC: -1258.326773880575

The lowest AIC is from an ARIMA(1,0) model.

bestAICmodel = ARIMA(decomposed.r.values, order=(1,0,0)) resultsARMA = bestAICmodel.fit(disp=0) resultsARMA.summary()

Dep. Variable: | y | No. Observations: | 228 |
---|---|---|---|

Model: | ARMA(1, 0) | Log Likelihood | 632.163 |

Method: | css-mle | S.D. of innovations | 0.015 |

Date: | Sun, 17 Jun 2018 | AIC | -1258.327 |

Time: | 14:44:00 | BIC | -1248.039 |

Sample: | 0 | HQIC | -1254.176 |

coef | std err | z | P>|z| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|

const | -0.0004 | 0.005 | -0.088 | 0.930 | -0.010 | 0.009 |

ar.L1.y | 0.8055 | 0.039 | 20.777 | 0.000 | 0.729 | 0.881 |

The standard error of the ar1 coefficient is good.

Next, we plot the ACF and PACF of the model.

sm_pl.plot_acf(resultsARMA.resid, lags=24);

sm_pl.plot_pacf(resultsARMA.resid, lags=24);

fig,ax = plt.subplots() ax.hist(resultsARMA.resid, bins=70) ax.grid(True) ax.set_xlabel("Residuals") ax.set_ylabel("Frequency") fig.suptitle("Distribution of Residuals");

np.mean(resultsARMA.resid)

8.5725149050511039e-05

The mean in the residuals is almost zero showing that the bias has been almost totally removed.

## How does the model behave?

Now we could split the initial dataset into the usual train and test sets and check how the trained model behaves on the test set.

# split into train and test (last two years) sets testSize = 24 milkData = X.LogMilk.values train, test = milkData[0:-testSize], milkData[-testSize:]

We use a one-step-ahead training. For each new observation in the test dataset, a prediction is claculated, then it is added back to the training test and used to improve the model.

Here is an example on the last model that we checked, the ARIMA(1,0).

history = [x for x in train] predictions = list() for t in range(len(test)): modelARMA = ARIMA(history, order=(1,0,0)) resultsARMA = modelARMA.fit(disp=0) yhat = resultsARMA.forecast()[0][0] # one step forecast predictions.append(yhat) history.append(test[t])

rmse = np.sqrt(sum((test - predictions)**2)/testSize) rmse

0.050296022575550173

The Rooted Mean Square Error (a classic metric to measure the model accuracy) is quite low.

Let’s see the predictions plotted:

predictions.pop(0) # remove the first element fig,ax = plt.subplots() ax.plot(test, label="Observed") ax.plot(predictions, label='Predictions', color='#ff0066'); ax.set_xlabel("Months") ax.legend(loc='upper right') fig.suptitle("Last months milk production");

The ARMA(1,0) is not bad: the predictions are quite good and the RMSE is quite low.

## Seasonal ARIMA model

The *statsmodels* package (from version 0.8) includes a function to model a seasonal arima.

We see how it works first using as parameters AR=1 and MA=0, plus the difference set to 1 to indicate that we want a data transformation to remove the trend and seasonality.

import statsmodels.api as sm_sa firstModel = sm_sa.tsa.statespace.SARIMAX(X.LogMilk.values, order=(1,1,0), seasonal_order=(1,1,0,12)) results = firstModel.fit() results.summary()

Dep. Variable: | y | No. Observations: | 228 |
---|---|---|---|

Model: | SARIMAX(1, 1, 0)x(1, 1, 0, 12) | Log Likelihood | 551.410 |

Date: | Sun, 17 Jun 2018 | AIC | -1096.821 |

Time: | 18:30:54 | BIC | -1086.533 |

Sample: | 0 | HQIC | -1092.670 |

– 228 | |||

Covariance Type: | opg |

coef | std err | z | P>|z| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|

ar.L1 | -0.1913 | 0.068 | -2.832 | 0.005 | -0.324 | -0.059 |

ar.S.L12 | -0.5278 | 0.059 | -8.995 | 0.000 | -0.643 | -0.413 |

sigma2 | 0.0003 | 2.75e-05 | 12.378 | 0.000 | 0.000 | 0.000 |

Ljung-Box (Q): | 105.30 | Jarque-Bera (JB): | 8.26 |
---|---|---|---|

Prob(Q): | 0.00 | Prob(JB): | 0.02 |

Heteroskedasticity (H): | 0.81 | Skew: | 0.08 |

Prob(H) (two-sided): | 0.39 | Kurtosis: | 3.95 |

Again we use a “grid search” to iteratively explore different combinations of parameters. For each combination of parameters, we fit a new seasonal ARIMA model with the SARIMAX() function from the statsmodels module and assess its overall quality.

This time is a double loop so it will take much longer time.

# Define the p and q parameters to take any value between 0 and 3 p = q = range(0, 4) sp = sq = range(0, 3) # Generate all different combinations of p, d=1 and q triplets pdq = [(a,1,b) for a,b in itertools.product(p, q)] # Generate all different combinations of seasonal p, q and q triplets seasonal_pdq = [(a, 1, b, 12) for a,b in list(itertools.product(sp, sq))]

best_score = np.inf best_pdq = None best_season_pdq = None temp_model = None for params in pdq: for params_seasonal in seasonal_pdq: try: temp_model = sm_sa.tsa.statespace.SARIMAX(X.LogMilk.values, order = params, seasonal_order = params_seasonal) results = temp_model.fit() if results.aic NEW best AIC: ", best_score) except: continue

(0, 1, 0) (0, 1, 1, 12) -> NEW best AIC: -1145.67087571 (0, 1, 0) (2, 1, 1, 12) -> NEW best AIC: -1146.24623229 (0, 1, 1) (0, 1, 1, 12) -> NEW best AIC: -1147.77581717 (0, 1, 1) (2, 1, 1, 12) -> NEW best AIC: -1147.84213857 (1, 1, 1) (0, 1, 1, 12) -> NEW best AIC: -1147.96663351 (1, 1, 1) (2, 1, 1, 12) -> NEW best AIC: -1148.79899832

The model with lowest AIC is the (1,1,1)(2,1,1) but they have all very close AIC values so it could be worth to check them more in detail one by one.

Here we just take the lowest AIC:

bestModel = sm_sa.tsa.statespace.SARIMAX(X.LogMilk.values, order=(1,1,1), seasonal_order=(2,1,1,12)) results = bestModel.fit() results.summary()

Dep. Variable: | y | No. Observations: | 228 |
---|---|---|---|

Model: | SARIMAX(1, 1, 1)x(2, 1, 1, 12) | Log Likelihood | 580.399 |

Date: | Sun, 17 Jun 2018 | AIC | -1148.799 |

Time: | 18:47:26 | BIC | -1128.223 |

Sample: | 0 | HQIC | -1140.497 |

– 228 | |||

Covariance Type: | opg |

coef | std err | z | P>|z| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|

ar.L1 | 0.5334 | 0.265 | 2.013 | 0.044 | 0.014 | 1.053 |

ma.L1 | -0.6859 | 0.230 | -2.984 | 0.003 | -1.136 | -0.235 |

ar.S.L12 | -0.1900 | 0.133 | -1.429 | 0.153 | -0.450 | 0.071 |

ar.S.L24 | -0.1710 | 0.095 | -1.791 | 0.073 | -0.358 | 0.016 |

ma.S.L12 | -0.7035 | 0.097 | -7.241 | 0.000 | -0.894 | -0.513 |

sigma2 | 0.0002 | 1.77e-05 | 14.007 | 0.000 | 0.000 | 0.000 |

Ljung-Box (Q): | 45.23 | Jarque-Bera (JB): | 27.85 |
---|---|---|---|

Prob(Q): | 0.26 | Prob(JB): | 0.00 |

Heteroskedasticity (H): | 0.84 | Skew: | 0.02 |

Prob(H) (two-sided): | 0.45 | Kurtosis: | 4.76 |

The *coef* column shows the weight (i.e. importance) of each feature and how each one impacts the time series. The *P>|z|* column informs us of the significance of each feature weight. Here, a couple of weights have a p-value higher than 0.05, so it could be reasonable to remove from the model.

The *results* object has a convenient diagnostic method named *plot_diagnostics()* to get a rundown of the fitted model.

results.plot_diagnostics(figsize=(16, 10));

The residual plot of the fitted model in the upper left corner appears to be white noise as it does not display obvious seasonality or trend behaviour.

The histogram plot in the upper right corner pair with the kernel density estimation (red line) indicates that the time series is almost normally distributed. This is compared to the density of the standard normal distribution (green line).

Not a perfect normal distribution tough, as the Q-Q chart in the bottom left shows.

The correlogram (autocorrelation plot) confirms this results, since the time series residuals show low correlations with lagged residuals.

The model produces a satisfactory fit that describes the underlying time series data well enough to forecast future values. Although the fit so far appears to be fine, a better fit could be achieved with a more complex model, i.e. a better model might have been von when increasing the parameter space for the grid search.

# Forecast

It is easy to forecast values using the previously fitted model.

The *get_prediction* and *conf_int* methods calculate predictions for future points in time for the previously fitted model and the confidence intervals associated with a prediction, respectively.

## the last 12 months

First, we get the predictions for the last 12 months, to see how the model behaves:

HISTORIC = X.shape[0] PRED = 12

last12m = results.get_prediction(start=HISTORIC-PRED+1, end=HISTORIC, dynamic=False, full_results=True)

The *dynamic=False* argument ensures that we produce one-step ahead forecasts, meaning that forecasts at each point are generated using the full history up to that point.

fig,ax = plt.subplots() ax.plot(X.YearMonth[-PRED:], X.LogMilk[-PRED:], label="Observed") ax.plot(X.YearMonth[-PRED:], last12m.predicted_mean, label='Modelled', color='#ff0066'); # avoid overlapping of ticks on x ax plt.setp(ax.get_xticklabels(), rotation=30, horizontalalignment='right') ax.set_xlabel("Time") ax.legend(loc='upper right') fig.suptitle("Last year milk production");

rmse = ((last12m.predicted_mean - X.LogMilk[-PRED:]) ** 2).mean() rmse

0.0021989614592391187

Both the plot and the RMSE are quite good.

## the next 24 months

To get the forecast we can call the *get_forecast* function:

FORECAST = 24 next24m = results.get_forecast(steps=FORECAST) next24m_ci = next24m.conf_int()

There is a point forecast (the expected value) along with 95 percent confidence intervals. Notice, that the confidence intervals generally get wider for forecasts further out in time. It is hardly suprising that the forecast has more uncertaintly as time increases from the present.

Finally, we can plot the forecast but need first to create new dataframe with the related dates, for convenience.

futureDates = pd.date_range(start = '2014-01-01', periods=FORECAST, freq='MS') next24m_ci.index = futureDates futureDF = pd.DataFrame(next24m.predicted_mean, index=futureDates) futureDF.head()

0 | |
---|---|

2014-01-01 | 1.258927 |

2014-02-01 | 1.193670 |

2014-03-01 | 1.304142 |

2014-04-01 | 1.279353 |

2014-05-01 | 1.308496 |

fig,ax = plt.subplots() ax.plot(X.YearMonth, X.LogMilk, label="Observed") ax.plot(futureDF, label='Forecast', color='#ff0066'); # draw confidence bound ax.fill_between(next24m_ci.index, next24m_ci.iloc[:, 0], next24m_ci.iloc[:, 1], color='#ff0066', alpha=.25); ax.set_xlabel("Year") ax.set_ylabel("Milk") ax.legend(loc='upper left') fig.suptitle("Milk production with forecast");

The original time series of milk production is show in blue in the plot above. The forecast is show in red. The 95 percent confidence intervals are shown in lighter shades of red.

It is obvious that, the further we try to extrapolate the time series evolution into the future, the less confident our prediction becomes. Hence, the confidence bound widens in the course of time.

—

You can try yourself to decompose the Icecream.Prod column of the diary data frame and answer the following questions:

-Does icecream production have a noticable seasonal component or are the values all close to the average over time? Is there a strong seasonal component for icecream production?

-Does the acf plot indicate that the remainder series is stationary?

-Do the values in the histogram have an approximately normal distribution?

-Does the interquartile range for each month overlap, indicating that the decomposition has produced a reasonably good model of the seasonal variation?

Note: this post is part of a series about Machine Learning with Python.

Pingback: Introduction to time series – Part I: the basics – Look back in respect