Time Series: Air revenue passenger prediction

7 minute read

Overview

Under this project scope we will forecast monthly air passenger revenue using time series techniques.
Time series is useful method in machine learning when we need to forecast a value given a time component: an information about the time period previous values were recorded. Jupyter Notebook for current project can be found here.

Data: series sourced form "Federal reserve economic data" via Quandl API:
https://www.quandl.com/data/FRED/AIRRPMTSI-Air-Revenue-Passenger-Miles.
Observations include monthly air revenue starting from 2000 to present.

Import libraries

For this project we will need Quandl limbrary to retreive the data and statsmodels,api to work with statistical models. Throughout the analysis, we will visualize results with matplotlib and plotly libraries.

Python code

import quandl # use quandl linbrary to retreive the data 

"""
- will use pandas and numpy for data manipulation and calculus
- import datetime to work with datetime type
- use Series library to work with time series 

"""
import pandas as pd          
import numpy as np           
from datetime import datetime    
from pandas import Series  

import matplotlib.pyplot as plt 
%pylab inline
plt.style.use('seaborn-whitegrid')

# will ignore the warnings
import warnings                 
warnings.filterwarnings("ignore")

"""
- to work with statistical models we statsmodels.api and stats module from scipy
- use itertools and seasonal_decompose for data preparation and model building
- to plot PACF/ACF chart use tsaplots graphic module

"""
from scipy import stats
import statsmodels.api as sm
import statsmodels.tsa.api as smt
from statsmodels.tsa.api import Holt

import itertools
from itertools import product
from statsmodels.tsa.seasonal import seasonal_decompose # perform seasonal decomposition

from statsmodels.graphics.tsaplots import plot_acf
from statsmodels.graphics.tsaplots import plot_pacf

"""
- visualize results using plotly library, connected to current notebook

"""
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly
import plotly.graph_objs as go

init_notebook_mode(connected=True)

Data processing and EDA

To retreive data from source, we opearte .get() function, using token and dataset reference.

Python code

authtoken = "XXXXXX"
df = quandl.get("FRED/AIRRPMTSI", authtoken=authtoken)
df.head() # show how the data looks like 

LSTM

In practice it's not necessary to use the entire time series. To provide precise analysis we are going to use date range from 2010-01-01 to 2018-09-01.

Python code

df=df['2010-01-01':'2018-09-01']

# plot values from range
fig, ax = plt.subplots(figsize=(15,7)) # setting up size
df.Value.plot() # graph plot
plt.ylabel('Value')
plt.title('Monthly air revenue')
plt.show()

LSTM

Stationarity test

We see our series are characterized by systematically repeating cycles. To gain high performance of predictions we need to make them stationary to build and apply models afterwards. Then, we start with series decomposition using seasonal component and Dickey-Fuller stationarity test.

Python code

"""
- plot decomposed series, display Dickey-Fuller criteria applying  .adfuller() function 

"""

plt.figure(figsize(15,10))
sm.tsa.seasonal_decompose(df.Value).plot()
print "Dickey-Fuller criteria: p=%f" % sm.tsa.stattools.adfuller(df.Value)[1]

Dickey-Fuller criteria: p=0.999088

LSTM LSTM

The high coefficient value for Dickey-Fuller criteria proves non-stationarity on 5% significance level.
The residuals and seasonality for decomposed series have a systematic character.

Dispersion stabilization

Additionally we are able to apply Box-Cox power transformation in order to reach higher dispersion stability.

Python code

"""
- create additional column of tranformed values using Box-Cox method
- apply Dickey-Fuller test for transformed values 

"""
df['Value_t'], lam=stats.boxcox(df['Value'])
plt.figure(figsize(13,5))
df['Value_t'].plot()
plt.ylabel(u'FFE_amount-transformed')
print "lambda-optimum: %f" % lam
print "Dickey-Fuller criteria: p=%f" % sm.tsa.stattools.adfuller(df['Value_t'])[1]

lambda-optimum: -0.039101
Dickey-Fuller criteria: p=0.999030

LSTM

Differencing

Whereas the With Box-Cox transformation applied supposed to result stationary process based on Dickey-Fuller criteria, we still can observe non-ramdomized cycles in time series.
To avoid systematic flactuations, we can try differencing operation.

Python code

"""
- applying 12 order differencing, we reduce data 
- however, this operation will improve the stationarity ratio 

"""

df['diff'] = df.Value_t.diff(12)
plt.figure(figsize(15,10))
sm.tsa.seasonal_decompose(df['diff'][12:]).plot()
print "Dickey-Fuller criteria: p=%f" % sm.tsa.stattools.adfuller(df['diff'][12:])[1]

Dickey-Fuller criteria: p=0.599792

LSTM LSTM

Based on Dickey-Fuller criteria we can't prove stationarity, but we are able to make differencing for new series.

Python code

df['diff2'] = df['diff'].diff(1)
pyplot.plot(df['diff2'])
pyplot.show()
print("Dickey-Fuller criteria: p=%f" % sm.tsa.stattools.adfuller(df['diff2'][13:])[1])

LSTM

Dickey-Fuller criteria: p=0.000324

The resulted series can be categorized as stationary and described more like noise process, that can allow us to prepare model.
Before model fitting, represent the autocorrelation function (ACF) and partial autocorrelation function (PCF) plots.

AR model for lagged values estimation

Let's fit Autoregressive model to establish the optimum number of lagged values.
We will use transformed values since only stationary series could be passed in model.

Python code

"""
- shift 13 orders, since differencing was applied 
- autoregression package is already implemented in statmodels as AR()
- order selection wil be based on AIC (Akaike criterion)

"""
data=df['diff2'][13:]
model=smt.AR(data)
order=smt.AR(data).select_order(ic='aic', maxlag=25)

print 'Best lag order = {}'.format(order)

Best lag order = 8

Given best lag order we are ready to visualize the autocorrelation/partial autocorrelation with ACF and PACF plots.

Python code

plt.figure()
plt.subplot(211)

plot_acf(df['diff2'][13:], ax=plt.gca(), lags=8)
pyplot.subplot(212)
plot_pacf(df['diff2'][13:], ax=plt.gca(), lags=8)
pyplot.show()

LSTM LSTM

ACF plots display correlation between series and its lags and supposed to help in determining the order of the MA (q) model.
Partial autocorrelation plots (PACF), as the name suggests, display correlation between a variable and its lags that is not explained by previous lags. PACF plots are useful to set up the order of the AR(p) model.

Model selection

Before the model fitting the train/test split is preffered.
For a test sample the last 15 observations (15 months, 14%) will be suitable.

train=df[:'2017-06-01']
test=df['2017-07-01':]

Holt-Winter's model

We can start from building Holt-Winter's model with exponental smoothing, adding trend component.

Python code

holt = Holt(np.asarray(train['Value'])).fit(smoothing_level = 0.3,smoothing_slope = 0.1)
test['Holt_linear'] = holt.forecast(len(test))

plt.figure(figsize=(14,6))
plt.plot(train['Value'], label='Train')
plt.plot(test['Value'], label='Test')
plt.plot(test['Holt_linear'], label='Holt_linear')
plt.title('Holt linear trend')
plt.legend(loc='best')
plt.show()

LSTM

ARIMA: parameters selection

SARIMAX implenetation allows us to test range of parameters such as trend, seasonal component and noise for best model performance in the same way as "grid-search" does.

The PACF graph represents 3 last lag value significantly different from zero. The value of $p$ could be chosen as $3$.
For $q$ value ACF represnts the first lag only as a considerable non-zero value, but in order to select the best from the range of models, its possible to fit SARIMAX with larger value, for instance, q from range(0,3).

Seasonal component $D=1$ and difference parameter $d=1$.
Parameters Q and P will be selected based on best model from range.

Python code

"""
- since we need to fit model with all possible combinations
we use product of parameters lists 

"""

P = range(0, 2)
Q = range(0, 2)
p = range(0, 3)
q = range(0, 3)

a = list(itertools.product(p, Q, p, q))

"""
- will add results of modeling to list based on AIC criterion 

"""
best_aic = float("inf")
warnings.filterwarnings('ignore')

for param in a:
   try:
       model=sm.tsa.statespace.SARIMAX(train.Value_t, order=(param[2], 1, param[3]), 
                                       seasonal_order=(param[0], 1, param[1], 12)).fit(disp=-1)
   except ValueError:
       print('wrong parameters:', param)
       continue
   aic = model.aic
   if aic < best_aic:
       best_model = model
       best_aic = aic
       best_param = param
   
warnings.filterwarnings('default')

best_model.summary() 

LSTM

Ljung-Box criteria testifies that residuals are not autocorrelated with high significance level.
Autocorrelation plot presents the only outlier, that was not considered by model. In general, resuduals are converged around zero. Considerable difference is not orbserved on seasonal lags.

Python code

"""
- .plot_acf() to visualize the residuals partial autocorrelation for best model 

"""

plt.figure(figsize(13,6))
plt.subplot(211)
best_model.resid[1:].plot()
plt.ylabel(u'Residuals')

ax = plt.subplot(212)
sm.graphics.tsa.plot_acf(best_model.resid[13:].values.squeeze(), lags=8, ax=ax)

LSTM LSTM

Student citeria does not reject the hypothesis of unbiased estimation, while Dickey-Fuller rejects the hypothesis of non-stationarity.

Python code

# use ttest_1samp for student criterion 
print "Student criteria: p=%f" % stats.ttest_1samp(best_model.resid[13:], 0)[1]
print "Dickey-Fuller criteria: p=%f" % sm.tsa.stattools.adfuller(best_model.resid[13:])[1]

Student criteria: p=0.953890
Dickey-Fuller criteria: p=0.000000

Define the reverse procudere for Box-Cox transformation below, applying to train sample. Then, we will plot train predictions, overlaped by true values to compare the modelling result with expected series.

Python code

"""
- function will return transformation based on given lambda:
- exp transformation for lambda=0;
- inverse log transformation if lambda<>0 

"""
def invboxcox(y,lam):
   if lam == 0:
       return(np.exp(y))
   else:
       return(np.exp(np.log(lam*y+1)/lam))

train['Value_new'] = invboxcox(best_model.fittedvalues, lam)
plt.figure(figsize(15,7))
train.Value.plot()
train.Value_new[13:].plot(color='r')
plt.ylabel('Value')
pylab.show()

LSTM

Evaluating predictions

On the next step we evaluate predictions on test and plot result combind with train data set.

Python code

"""
- make predictions, inversely transform and add to data frame
- combine historical values with forecasted to display on graph 

"""

l=invboxcox(best_model.predict(start=89, end=120), lam)
l=l.to_frame()
l=l.rename(columns={0: 'Forecast'})
test=test.rename(columns={"Value": 'Test_value'})

df2=train[['Value']]
# concat train and forecast frames for plotting
df2 = pd.concat([df2, l])

"""
- combine true  data and forcasted values in one figure 
- plot figure inside notebook with connection mode on 

"""

data1 = go.Scatter(
         x=df.index,
         y=df2['Value'],
   name='True Value')

data2 = go.Scatter(
         x=l.index,
         y=l['Forecast'],
name='Forecast Value')

data= [data1, data2]
layout = {'title': 'Air Revenue (train) with Forecasted'}
fig=go.Figure(data=data, layout=layout)

iplot(fig, show_link=False)

LSTM

Forecat Accuracy Evaluation

To evaluate model performace, we are going to estimate mean squared and mean absolute errors and plot forecasted values over the test sample.

Python code

"""
- display forecatsted and test values by red and green respectively 

"""

plt.figure(figsize=(10, 7))
plt.plot(df2['Value'], 'b-')
plt.plot(l['Forecast'][:'2018-11-05'], 'r-')
plt.plot(test['Test_value'], 'g-')
plt.legend(); plt.xlabel('Date'); plt.ylabel('Air Revenue')
plt.title('Air Revenue: entire samle with Forecasted')

LSTM

Estimate MSE and MAE on test:

Python code

# mean squared value on test 
mse = ((l['2017-07-01':'2018-09-01']['Forecast'] - test['2017-07-01':'2018-09-01']['Test_value']) ** 2).mean()
print('The Mean Squared Error of forecasts is {}'.format(round(mse, 2)))

# mean absolute value on test 
mae = (l['2017-07-01':'2018-09-01']['Forecast'] - test['2017-07-01':'2018-09-01']['Test_value']).mean()
print('The Mean Absolute Error of forecasts is {}'.format(round(mae, 2)))

The Mean Squared Error of forecast is 1.96499098925e+12
The Mean Absolute Error of forecast is -578170.27

References:

  1. https://www.coursera.org/lecture/data-analysis-applications/vriemiennyie-riady-EjNEV
  2. https://otexts.com/fpp2/
  3. http://www.iosrjournals.org/iosr-jm/papers/vol1-issue3/C0131020.pdf
  4. https://www.statsmodels.org/dev/generated/statsmodels.tsa.ar_model.AR.select_order.html