Introduction to time series – Part I: the basics

A Time series is a data set collected through time.

What makes it different from other datasets that we used for regular regression problems are two things:

  1. It is time dependent. So the basic assumption of a linear regression model that the observations are independent doesn’t hold in this case.
  2. Most time series have some form of trend – either an increasing or decreasing trend – or some kind of seasonality pattern, i.e. variations specific to a particular time frame.

Basically, this means that the present is correlated with the past.
A value at time T is correlated with the value at T minus 1 but it may also correlated with the value at time T minus 2, maybe not quite as much as T minus 1.
And even at 20 times steps behind, we could still know something about the value of T because they’re still correlated, depending on which kind of time series it is.
And this obviously is not true with normal random data.

Time series are everywhere, for example in:

  • Financial data (stocks, currency exchange rates, interest rates)
  • Marketing (click-through rates for web advertising)
  • Economics (sales and demand forecasts)
  • Natural phenomenon (water flow, temperature, precipitation, wind speed, animal species abundance, heart rate)
  • Demographic and population and so on.

What might you want to do with time series?

  • Smoothing – extract an underlying signal (a trend) from a noise.
  • Modelling – explain how the time series arose, for intervention.
  • Forecasting – predict the values of the time series in the future.

We first see here which specific characteristics the Time Series (TS) have, and will then see in a second part a concrete example of TS analysis (smoothing + modelling + forecasting).

You can follow along with the associated notebook in GitHub.

Stochastic vs. deterministic model

In a deterministic process you know exactly where the next point is, in the stochastic process there is some randomness at every point.

A deterministic model is one whose behaviour is entirely predictable. Every set of variable states is uniquely determined by parameters in the model and by sets of previous states of these variables. Therefore, these models perform the same way for a given set of initial conditions, and it is possible to predict precisely what will happen.

A stochastic model is one in which randomness is present and variable states are not described by unique values but rather by probability distributions. The behaviour of this model cannot be entirely predicted.

For example, a deterministic process can be the sine wave function y = sin(x)

import pandas as pd
import numpy as np


n = 100
x = np.linspace(0, 2 * np.pi, n)
y = np.sin(x)

import matplotlib.pyplot as plt

fig,ax = plt.subplots()
fig.suptitle("A deterministic process")


While a purely random process, e.g. the white noise, where every step is randomly taken from a normal distribution is an example of stochastic process:

whiteNoise = np.random.standard_normal(100) # normal distribution mean=0, std=1

fig,ax = plt.subplots()
fig.suptitle("A discrete white noise")


As you see, it is basically impossible to predict the value of the next step.

Assuming that our time series are a realisation of a stochastic  process (a sequence of random variables X1, X2, … ) will allow us – if we know something about the stochastic process and at least the initial data point – to predict the next data points of the time series. That is, we are going to assume that there is some underlying generating process for our time series based on one or more statistical distributions from which these variables are drawn.

Let’s see another example: we inject some randomness into the above sine wave:

n = 100
x = np.linspace(0, 2 * np.pi, n)
noisyWave = np.sin(x) + 0.3 * np.random.randn(n)

fig,ax = plt.subplots()
fig.suptitle("A stochastic process: a noisy sine wave")


There is randomness and we cannot say exactly where the next point will be but we can also see that there is an underlying generating process: the sine wave.
We can now see some tools and methods to model this generating process.


In a (weak) stationary time series, there is

  • No systematic change in mean
  • No systematic change in variance
  • No periodic variations, fluctuations

This means that the properties of one section of a data are much like the properties of the other sections of the data.

But why is this important? Most of the TS models work on the assumption that the TS is stationary. Intuitively, we can sat that if a TS has a particular behaviour over time, there is a very high probability that it will follow the same in the future.

Signal or noise?

One of the challenge is then to extract signal from noise. The autocorrelation will help us.

We have already seen concepts like mean, variance and correlation.
A short summary:

  • The expected value E(x) of a random variable x is its average value (mean) in the population.
  • The variance of a random variable is its “spread“: the expectation of the squared deviations of the variable from the mean.
  • The covariance of two random variables x and y,  tells us how the two variables move together. It measures the linear dependence between two random variables.
  • Correlation is a dimensionless measure of how two variables vary together, or “co-vary”. In essence, it is the covariance of two random variables normalised by their respective spreads.
def getCovariance(x, y):
    # returns the covariance of two data sets

        # Subtract the subseries means
    return, (y-np.mean(y))) / len(x)

One of the most important aspects of time series is the autocorrelation, how sequential observations in a time series affect each other.
Let’s see it in details.

Autocovariance and autocorrelation

There are many situations where consecutive elements of time series will possess correlation. That is, the behaviour of sequential points in the remaining series affect each other in a dependent manner.
The idea is to identify the structure of these correlations, as they will allow us to markedly improve the models and forecasts.

Auto means self and the autocovariance is the covariance of X (a stochastic process) with a lagged version of itself, and here h is the lag:

autocovariance_{h} (X_{t}) =Cov(X_{t}, X_{t-h})

def getAutoCovariance(x, lag):
    # return the covariance of one data set by
    # slicing it based on the given lag

        # slice into two sub sets
    x1 = x[:(len(x)-lag)] # from begin till (end-lag)
    x2 = x[lag:]  # from lag till the end

        # get the covariance of the two sub-sets
    covariance = getCovariance(x1, x2)

        # return the covariance but also the
        # standard deviations of the two subsets
    return (covariance, np.std(x1), np.std(x2))

Now autocorrelation is, by definition, the normalised version of the autocovariance:

autocorrelation_{h} (X_{t}) =\frac{autocovariance_{h} (X_{t})}{Std(X_{t})\cdot Std(X_{t-h})}

As with the correlation, the largest value is one and the lowest value is minus one.

We know how to compute the covariance but we don’t know the values of the variables X, they are random variables.
But if the signal is weakly stationary, then the variance and therefore the standard deviation are constant over time.
Which means that the standard deviations at the denominator in the formula above are exactly the same, because the standard deviation doesn’t change over time.

Therefore on the bottom you get Std(X_{t}) \cdot Std(X_{t-h}) = Std(X_{t})^{2}

If the signal is weakly stationary, it means that this autocorrelation is constant over time.
And because of that, it means that we can use the data to create a sample version of the autocorrelation. We just take subsequent slices or lags (the samples) of the data series.

def autocorrelation_var(x, lag):
      # get the covariance by slicing
    (covariance, _, _) = getAutoCovariance(x, lag)

      # Normalize with **variance** of whole series
    return covariance /  np.var(x)

def autocorrelation_std(x, lag):
         # get the covariance by slicing
    (covariance, std_x1, std_x2) = getAutoCovariance(x, lag)

        # Normalize with the subseries **standard deviations**
    return covariance / (std_x1 * std_x2)

The statsmodels package has also an autocorrelation function (acf) which calculates for each lag the autocorrelation. And also Pandas has one, the autocorr.

Let’s see them all together, applied to the white noise process:

import statsmodels.tsa.stattools as smt

results = {}
results["Slices_variance"] = [autocorrelation_var(whiteNoise, lag) for lag in range(nlags)]
results["Slices_Std"] = [autocorrelation_std(whiteNoise, lag) for lag in range(nlags)]
results["Pandas"] = [pd.Series(whiteNoise).autocorr(lag) for lag in range(nlags)]
results["Statmodels"] = smt.acf(whiteNoise, unbiased=True, nlags=nlags-1)

pd.DataFrame(results).plot(kind="bar", figsize=(10,5), grid=True)
plt.ylim([-0.3, 1.2])
plt.suptitle("Different ways to compute autocorrelation")

output_25_0Pandas is using the variance, while statsmodels is using the std.

The autocorrelation coefficient

An important concept for modelling is the autocorrelation coefficient.
Let’s call \gamma _{k} the autocovariance between x at time t and x at time t plus k. And since we assume the weak stationarity, it doesn’t matter what t is.
We would have same \gamma _{k} as long as the distance between these two random variables is k.

And we define c_{k}   its estimation (because we do not have a statistic process, we have a time series).

The autocorrelation coefficient rho(k) between Xt and Xt+k is :

\rho_{k} = \frac{\gamma _{k}}{\gamma _{0}}

basically the ratio between the autocovariance coefficient of lag k and the auto covariance coefficient at lag 0, which is the first auto covariance coefficient.
And this ratio is always between -1 and 1.

Again, we could just estimate it: rk = ck / c0

Remember, ck, was our estimation for the auto covariance coefficient at lag k, and c0 is our estimation for the auto covariance coefficient at lag 0.
The most important part here is the time difference between these two random variables. Which is k, the lag.

The correlogram

If I calculate these coefficients for each lag starting from 0 to n and plot them, I get a correlogram (statsmodels has a handy function for it):

import as smpl

smpl.plot_acf(whiteNoise, lags=20);


The blue shadow shows the confidence interval (coefficients are an estimation), so we should pay attention to the coefficients sticking out.

It always starts at 1 because for lag 0 is basically c0/c0, which is 1.
Then later on, there is not much correlation between all the different lags because we generated this data as a purely random process.  So this plot tells us that there are not much significant lags in the previous steps.

How is a correlogram going to help us? It’s a visual immediate representation of the time series auto correlation.
Let’s see an example where every point is correlated with the previous one: a sequence.

sequence1 = np.arange(0, 1, 1/100)  # a sequence from 0 to 1 by 1/100
smpl.plot_acf(sequence1, lags=20);


As expected, each lag is correlated with the previous one.
This is clearly visible from the plot.

sequence2 = np.array([1,2,3])
sequence2 = np.resize(sequence2, 21) # repeat it 7 times
smpl.plot_acf(sequence2, lags=20);
There is a pattern

From this plot, is also immediately visible that there is a repeating pattern.

The partial autocorrelation

The partial autocorrelation tries to find the effect of X of t on X of t + h, all the way to the right, after we control for or partial out the intervening random variables.

smpl.plot_pacf(whiteNoise, lags=20);


The partial autocorrelation function also does not show any correlation for the white noise.
These correlograms are an immediate way to see if the process is a random one, such as a white noise.

Now that we have examined the white noise, we are going to move on to a famous model, namely the Random Walk.

Random Walk

This is one of the most basic concept of the time series and can serve as model for many time series, for example financial time series can be modelled by a random walk.

A random walk is a stochastic process that relates the current value of the time series to the previous value by adding some random deviation to the previous value:

X_{t} = X_{t-1} + Z_{t}

where Z_t is the white noise.

It can be shown that the random walk is simply the sum of the elements from a discrete white noise series. In particular, the covariance is equal to the variance multiplied by the time; as time increases, so does the variance.
Hence, the random walk is not a stationary process. Also, if we check the covariance, we see that too is dependent on time.

def plotTS(y, lags=None, figsize=(10, 8), style='bmh'):
    if not isinstance(y, pd.Series):
        y = pd.Series(y)
        fig = plt.figure(figsize=figsize)
        layout = (2, 2)
        ts_ax = plt.subplot2grid(layout, (0, 0), colspan=2)
        acf_ax = plt.subplot2grid(layout, (1, 0))
        pacf_ax = plt.subplot2grid(layout, (1, 1))

        ts_ax.set_title('Time Series Analysis Plots')
        smpl.plot_acf(y, lags=lags, ax=acf_ax, alpha=0.5)
        smpl.plot_pacf(y, lags=lags, ax=pacf_ax, alpha=0.5)


The function above will plot the time series, followed by the correlograms.
Let’s define a random walk and then we apply to it the function above:

SAMPLES = 1000
Z = np.random.normal(size = SAMPLES)
for t in range(SAMPLES):
    rw[t] = rw[t-1] + Z[t]

plotTS(rw, lags=20);


Clearly our random walk is not stationary.
Recall that a random walk is X_{t} = X_{t-1} + Z_{t} . Using algebra we can say that X_{t} - X_{t-1} = Z_{t} . Thus the first differences of our random walk series should equal a white noise process!
We can use the “np.diff()” function on our TS and see if this holds.

plotTS(np.diff(rw), lags=20);


Our definition holds as this looks exactly like a white noise process!

Applying the difference to a Time Series is one way to remove the trend; this will get a stationary time series from a random walk.
Let’s see more ways to extract the trend from a Time Series.

Extract the trend

One of the basic information we want to see in a time series is if it has a trend and how it looks like.
Let’s get back to our noisy sine wave.
We know that there is a trend (because of how it has been built) but how to extract it?

One way to get the trend is to use regression, basically a local regression.

You define a window of width m and then you just average the previous m values to get a smooth estimate for the current time. And then you continue to move over, use the next m values to get the estimate for the next point and so on.

This is called moving average and is used to get a fine trend (it’s basically a low-pass filter).

To get a higher level trend, a method called Loess (or Lowess) can be used.
It still defines a window and uses the nearest neighbours that are within that window to do the local regression. But the Loess regression uses polynomial terms: (1-distance(t,t_{0})^{3})^{^{3}}  .
Higher weights for the nearer points.

Trend from moving average

def movingAverage(values, order):
    # perform local regression on "values" using moving windows of width "order"
    end = len(values)
    out = np.zeros(len(values))

    out[0] = values[1]
    for i in range(1,end):
        if (i - order <= 1):
            j = 1
            j += 1
        out[i] = sum(values[j-1:i]) / (i-j+1)

    return out

Let's try with a window of width 10 (moving average of order 10) on our noisy sine wave:

trendMA = movingAverage(noisyWave,10)

fig,ax = plt.subplots()
fig.suptitle("Trend extracted from noisy sine wave")


You can adjust how granular the trend is by increasing or decreasing the moving average window (above it uses a 10 months window).

Trend from Loess regression

The Loess regression is more complicated but there is a function for it in the statsmodels package:

import statsmodels.nonparametric.smoothers_lowess as smL

trendL = smL.lowess(noisyWave, x, frac=0.25, return_sorted=False)

fig,ax = plt.subplots()
ax.plot(x,trendMA, label='Trend from Moving Average')
ax.plot(x,trendL, label="Trend from Loess")
fig.suptitle("Trends extracted from noisy sine wave")


You can see that the trend is smoother with Loess.

Model a time series

The simplest model that we could use to make predictions would be to persist the last observation and it provides a super-simple baseline.

The mean model is also a starting point for constructing forecasting models for time series data. It assumes that the best predictor of what will happen tomorrow is the average of everything that has happened up until now.

The random walk model assumes that the best predictor of what will happen tomorrow is what happened today, and all previous history can be ignored.

Intuitively there is a spectrum of possibilities in between these two extremes.

We are now going to see a couple of flexible models that can be used to model a time series and therefore to make predictions on future values.
There are many models available, also quite sophisticated, so these are just two basic examples: the Moving Average (MA) model and the Autoregressive (AR) model.
They can account for season and trend but also capture the effects of autocorrelation embedded in the data.

Moving Average (MA) model

We are going to introduce the Moving Average of order q model, known as MA(q).
Its concept is to take an average of what has happened in some window of the recent past.

The model is a linear combination of white noise terms.

Mathematically, the MA(q) is a linear regression model for a time series X_t:

x_{t} = w_{t} + \beta_{1} w_{t-1} + ... + \beta_{q} w_{t-q}

Where w_{t} is white noise with mean=0 and variance=σ2.

The motivation for the MA model is that we can observe “shocks” (the last q shocks) in the error process directly by fitting a model to the error terms.

The mean of a MA(q) process is zero. This is easy to see as the mean is simply a sum of means of white noise terms, which are all themselves zero.

Let’s start with a MA(1) process. If we set β1=0.6 we obtain the following model:


SIZE = 100
ma1 = np.zeros(SIZE)
    # Generate noise
noise = np.random.standard_normal((SIZE,)) # normal distribution
    # Loop for generating MA(1) process

for i in range(0, SIZE):
    ma1[i] = noise[i] + 0.6*noise[i-1]

fig,ax = plt.subplots()
fig.suptitle("A moving average process of order 1")


The correlogram for an MA(1) will be interesting:

smpl.plot_acf(ma1, lags=20);


As we saw above in the formula for ρk, for k>q, all autocorrelations should be zero. Since q=1, we should see a significant peak at k=1 and then insignificant peaks subsequent to that.
And that is exactly what we see.

In fact, this is a useful way of seeing whether an MA(q) model is appropriate. By taking a look at the correlogram of a particular series we can see how many sequential non-zero lags exist. If q such lags exist then we can legitimately attempt to fit a MA(q) model to a particular series.

In the associated notebook, I added more example, for MA(2) and MA(4).

Autoregressive model

The autoregressive model is essentially a linear regression model where the previous terms are the predictors (this is where the “regressive” comes from in “autoregressive”):

x_{t} = w_{t} + \beta_{1} x_{t-1} + ... + \beta_{p} x_{t-p}

Where w_{t} is white noise, beta are the coefficients and p is the model order.

Note that the AR(1) model is a random walk with beta1 equal to unity. Generally,  AR(p) models are an extension of the random walk that includes terms further back in time.

The mean of an AR(p) process is zero. However, the autocovariances and autocorrelations are given by recursive functions, known as the Yule-Walker equations.

SIZE = 100
beta = 1
X = np.zeros(SIZE)
    # Generate noise
W = np.random.standard_normal((SIZE,)) # noise as a normal distribution
    # Loop for generating AR1 process
#X[1] = W[1]
for t in range(ORDER, SIZE):
    X[t] = beta * X[t-1] + W[t]

fig,ax = plt.subplots()
fig.suptitle("AR(1) Time Series == random walk");


smpl.plot_acf(X, lags=20);


It looks like a random walk.
An AR(1) process is only stationary when -1 < beta < 1.
Let’s try a different AR(1) process with beta different than 1:

SAMPLES = 1000
beta = 0.6
X = W = np.random.normal(size = SAMPLES)
for t in range(SAMPLES):
    X[t] = beta * X[t-1] + W[t]

plotTS(X, lags=20);


There is significant serial correlation between lagged values but interestingly the PACF plot shows only one significant value.
Let’s try an AR(2) process.

SAMPLES = 1000
X = W = np.random.normal(size = SAMPLES)
for t in range(SAMPLES):
    X[t] = 0.666 * X[t-1] - 0.333 * X[t-2] + W[t]

plotTS(X, lags=20);

As before we can see that the ACF correlogram differs significantly from that of white noise, as we’d expect.
The PACF plot shows two statistically significant peaks at k=1, k=2.

This is not by accident, an AR model of order p will always have a PACF with p significant values.

Given an unknown time series,  you can look at PACF and if you see that PACF cuts off after some lag p, that gives a reason to model the data using an autoregressive process with order p.

ARIMA – AutoRegressive Integrated Moving Average model

AR and MA models, that we have just seen, are special cases of a more general class of time series models known as ARIMA models.

ARIMA stands for Auto-Regressive Integrated Moving Averages.
The ARIMA forecasting for a stationary time series is nothing but a linear equation. The predictors depend on the parameters (p,d,q) of the ARIMA model:

  • Number of AR (Auto-Regressive) terms (p):
    AR terms are just lags of dependent variable.
    For instance if p is 3, the predictors for x(t) will be x(t-1)….x(t-3).
  • Number of MA (Moving Average) terms (q):
    MA terms are lagged forecast errors in prediction equation.
    For instance if q is 3, the predictors for x(t) will be e(t-1)….e(t-3) where e(i) is the difference between the moving average at i-th instant and actual value.
  • Number of Differences (d): These are the number of nonseasonal differences.
    If no difference has been applied, d will be 0. If a first-oder difference has been applied (e.g. np.diff(time-series) ) then d will be 1, and so on.

Here are some of the nonseasonal ARIMA models that you most often encounter:

  • ARIMA(0,0,0) = mean (constant) model
  • ARIMA(0,1,0) = random walk model
  • ARIMA(1,0,0) = 1st order autoregressive model
  • ARIMA(2,0,0) = 2nd order autoregressive model
  • ARIMA(1,1,0) = 1st order AR model applied to first difference of Y
    and so on …

Objective: adjust the knobs (p,d,q) until the residuals are “white noise” (uncorrelated).

Let’s see some examples.
The statsmodels package offer a function to generate ARIMA models:

import statsmodels.tsa.arima_process as ap

  # AR and MA coeffs need always 1 at first
simulated = ap.arma_generate_sample(ar=[1], ma=[1], nsample=100) 

fig,ax = plt.subplots()
fig.suptitle("Generated ARIMA(0,0,0)")


ar_coef = [1, -0.75, 0.25]
ma_coef = [1, .65, .35]
n = 100

arma_process = ap.ArmaProcess(ar_coef, ma_coef)
simulated = ap.arma_generate_sample(ar_coef, ma_coef, n)

fig,ax = plt.subplots()
fig.suptitle("Generated ARIMA(2,0,2)");


I will just use the above generated (2,0,2) time series:

from statsmodels.tsa.arima_model import ARIMA

modelFitted = ARIMA(simulated, order=(2,0,2))
results = # disp: do not display covergence info
ARMA Model Results
Dep. Variable: y No. Observations: 100
Model: ARMA(2, 2) Log Likelihood -134.832
Method: css-mle S.D. of innovations 0.918
Date: Sat, 16 Jun 2018 AIC 281.664
Time: 14:56:50 BIC 297.295
Sample: 0 HQIC 287.990
coef std err z P>|z| [0.025 0.975]
const 0.1044 0.294 0.355 0.723 -0.472 0.681
ar.L1.y 1.0570 0.152 6.959 0.000 0.759 1.355
ar.L2.y -0.5977 0.111 -5.398 0.000 -0.815 -0.381
ma.L1.y 0.5753 0.174 3.308 0.001 0.234 0.916
ma.L2.y 0.1602 0.204 0.786 0.434 -0.239 0.560


fig,ax = plt.subplots()
ax.plot(results.fittedvalues, color='red')
fig.suptitle("A contrived example of ARIMA modeling");


Rough rules of thumb:

  • If the stationarised series has positive autocorrelation at lag 1, AR terms often work best. If it has negative autocorrelation at lag 1, MA terms often work best.
  • ACF that dies out gradually and PACF that cuts off sharply after a few lags: AR signature
  • ACF that cuts off sharply after a few lags and PACF that dies out more gradually: MA signature

The process

Let’s summarise the general process when analysing a time series:

  1. Outline a hypothesis about a particular time series and its behaviour
  2. Obtain the correlogram of the time series and assess its autocorrelation
  3. Fit an appropriate model to reduce the autocorrelation in the residuals of the model and its time series
  4. Refine the fit until no correlation is present and use mathematical criteria to assess the model fit
  5. Use the model and its second-order properties to make forecasts about future values
  6. Assess the accuracy of these forecasts using statistical techniques (such as MSE for regressive metrics)
  7. Iterate through this process until the accuracy is optimal.

Let’s finally see an example of ARIMA model fitted to an existing time series in Part 2.

Note: this post is part of a series about Machine Learning with Python.

One thought on “Introduction to time series – Part I: the basics

  1. Pingback: Introduction to time series – Part II: an example – Look back in respect

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s