Machine Learning approach#

  • Time series prediction, or forecasting, can be very similar to modelling and prediction with tabular data.

    • A set of input variables, usually a single output variable.

    • Ordinary machine learning methods can be applied.

  • The inclusion of time can be done by adding one or more delayed variables, possibly including the response.

Validation#

  • As soon as time is part of a model, extra care needs to be taken in validation.

  • Cross-validation*, training-validation-test splits are still relevant.

  • However, training, validation and test sets need to follow time chronologically and cannot overlap.

  • Instead of traditional cross-validation, one can perform backtesting with a sliding or expanding window:

https://github.com/khliland/IND320/blob/main/D2Dbook/images/Backtesting_sliding.png?raw=TRUE https://github.com/khliland/IND320/blob/main/D2Dbook/images/Backtesting_expanding.png?raw=TRUE

Figures from Roy Yang’s bloggpost on uber.com

Note

The first samples will never be in the test set!

Shipping, oil, interest rates and exchange rates#

  • These data are public data from the Norwegian Bank, SSB, Eurostat and U.S. Energy Information Administration for the period 2000-2014 (monthly).

  • The data are available at ResearchGate and were part of a Master thesis by Raju Rimal.

# Read the FinalData sheet of the OilExchange.xlsx file using Pandas
import pandas as pd
# You may get a warning here, because the file contains pasted grahics
OilExchange = pd.read_excel('../../data/OilExchange.xlsx', sheet_name='FinalData') 
OilExchange.head()
/Users/kristian/miniforge3/envs/IND320_2024/lib/python3.12/site-packages/openpyxl/worksheet/_reader.py:329: UserWarning: Unknown extension is not supported and will be removed
  warn(msg)
Date PerEURO PerUSD KeyIntRate LoanIntRate EuroIntRate CPI OilSpotPrice ImpOldShip ImpNewShip ... ExpExShipOilPlat TrBal TrBalExShipOilPlat TrBalMland ly.var l2y.var l.CPI ExcChange Testrain season
0 2000-01-01 8.1215 8.0129 5.500000 7.500000 3.04 104.1 25.855741 114 915 ... 38619 18575 19238 -3257 8.0968 8.1907 103.6 Increase True winter
1 2000-02-01 8.0991 8.2361 5.500000 7.500000 3.28 104.6 27.317905 527 359 ... 38730 14217 17200 -4529 8.1215 8.0968 104.1 Decrease True winter
2 2000-03-01 8.1110 8.4111 5.500000 7.500000 3.51 104.7 26.509183 1385 929 ... 42642 13697 18380 -5562 8.0991 8.1215 104.6 Increase True Spring
3 2000-04-01 8.1538 8.6081 5.632353 7.632353 3.69 105.1 21.558821 450 2194 ... 36860 13142 15499 -5147 8.1110 8.0991 104.7 Increase True Spring
4 2000-05-01 8.2015 9.0471 5.750000 7.750000 3.92 105.1 25.147242 239 608 ... 42932 17733 18505 -5732 8.1538 8.1110 105.1 Increase True Spring

5 rows × 28 columns

OilExchange.columns
Index(['Date', 'PerEURO', 'PerUSD', 'KeyIntRate', 'LoanIntRate', 'EuroIntRate',
       'CPI', 'OilSpotPrice', 'ImpOldShip', 'ImpNewShip', 'ImpOilPlat',
       'ImpExShipOilPlat', 'ExpCrdOil', 'ExpNatGas', 'ExpCond', 'ExpOldShip',
       'ExpNewShip', 'ExpOilPlat', 'ExpExShipOilPlat', 'TrBal',
       'TrBalExShipOilPlat', 'TrBalMland', 'ly.var', 'l2y.var', 'l.CPI',
       'ExcChange', 'Testrain', 'season'],
      dtype='object')
# Read the FinalCodeBook sheet of the OilExchange.xlsx file using Pandas
Explanations = pd.read_excel('../../data/OilExchange.xlsx', sheet_name='FinalCodeBook')
Explanations[['Variables','Label']]
/Users/kristian/miniforge3/envs/IND320_2024/lib/python3.12/site-packages/openpyxl/worksheet/_reader.py:329: UserWarning: Unknown extension is not supported and will be removed
  warn(msg)
Variables Label
0 Date Date
1 PerEURO Exchange Rate of NOK per Euro
2 PerUSD Exchange Rate of NOK per USD
3 KeyIntRate Key policy rate (Percent)
4 LoanIntRate Overnight Lending Rate (Nominal)
5 EuroIntRate Money market interest rates of Euro area (EA11...
6 CPI Consumer Price Index (1998=100)
7 OilSpotPrice Europe Brent Spot Price FOB (NOK per Barrel)
8 ImpOldShip Imports of elderly ships (NOK million)
9 ImpNewShip Imports of new ships (NOK million)
10 ImpOilPlat Imports of oil platforms (NOK million)
11 ImpExShipOilPlat Imports excl. ships and oil platforms (NOK mil...
12 ExpCrdOil Exports of crude oil (NOK million)
13 ExpNatGas Exports of natural gas (NOK million)
14 ExpCond Exports of condensates (NOK million)
15 ExpOldShip Exports of elderly ships (NOK million)
16 ExpNewShip Exports of new ships (NOK million)
17 ExpOilPlat Exports of oil platforms (NOK million)
18 ExpExShipOilPlat Exports excl. ships and oil platforms (NOK mil...
19 TrBal Trade balance (Total exports - total imports) ...
20 TrBalExShipOilPlat Trade balance (Exports - imports, both excl. s...
21 TrBalMland Trade balance (Mainland exports - imports excl...
22 ly.var First Lag Exchange Rate of NOK per Euro
23 l2y.var Second Lag Exchange Rate of NOK per Euro
24 l.CPI First Lag of Consumer Price Index
25 ExcChange Change status of Exchange Rate (Increase, Decr...
26 Testrain Test and Train seperation of data
27 season Seasons

Modelling without time#

  • For starters, let us ignore time and build a simple prediction model for the exchange rate.

  • We will use scikit-learn’s Pipeline to combine standardisation (scaling) and linear regression and cross_val_predict to perform random K-fold cross-validation.

# Import Pipeline, StandardScaler, and LinearRegression from their respective modules in sklearn
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression

# Create a pipeline that scales the data and performs linear regression
pipe = Pipeline([('scaler', StandardScaler()), ('reg', LinearRegression())])

# Fit the pipeline with PerEURO as response and variables 3:-6 as predictors for the samples having True in the Testrain column
OilExchange_train = OilExchange.loc[OilExchange.Testrain==True,:].copy()
OilExchange_test = OilExchange.loc[OilExchange.Testrain==False,:].copy()
pipe.fit(OilExchange_train.loc[:, OilExchange_train.columns[3:-6]], \
         OilExchange_train.loc[:, 'PerEURO'])
Pipeline(steps=[('scaler', StandardScaler()), ('reg', LinearRegression())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
# Predict the corresponding data for Testrain = False
PerEURO_pred = pipe.predict(OilExchange_test.loc[:, OilExchange.columns[3:-6]])
# Plot the predicted values against the actual values
import matplotlib.pyplot as plt
plt.scatter(OilExchange_test.loc[:, 'PerEURO'], PerEURO_pred)
plt.xlabel('Actual PerEURO')
plt.ylabel('Predicted PerEURO')
plt.title('Test data predictions')
plt.show()
../../_images/ef318a1591e6d2b3988e45b94d8f639b9f6452e5347929afea2ba385f99ce48f.png
# R2 for the test data
from sklearn.metrics import r2_score
r2_score(OilExchange_test.loc[:, 'PerEURO'], PerEURO_pred)
-0.9790095192618866
# Perform k-fold cross-validation with k=10
from sklearn.model_selection import cross_val_predict # NOTE: Not for time series!
PerEURO_cv = cross_val_predict(pipe, OilExchange_train.loc[:, OilExchange.columns[3:-6]], \
                               OilExchange_train.loc[:, 'PerEURO'], cv=10)

# Compute R^2 for PerEURO_cv
r2_cv = r2_score(OilExchange_train.loc[:, 'PerEURO'], PerEURO_cv)
print("Cross-validated R2: {:.3f}".format(r2_cv))
Cross-validated R2: -0.317

Backtesting#

  • scikit-learn has a TimeSeriesSplit which creates segments for backtesting.

    • Expanding window is the default.

    • Sliding window can be applied by setting the right combination of parameters.

  • We will use scikit-learn’s cross_validate to perform the cross-validation based on the backtesting segments (cross_val_predict assumes that all observations will be test data at some point).

# Backtesting using scikit-learn
import numpy as np
from sklearn.model_selection import TimeSeriesSplit

# Some data
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([1, 2, 3, 4, 5, 6])

# Create time series cross-validation object with expanding window
tscv_expand = TimeSeriesSplit()
print(tscv_expand)
for i, (train_index, test_index) in enumerate(tscv_expand.split(X)):
    print(f"Fold {i}:")
    print(f"  Train: index={train_index}")
    print(f"  Test:  index={test_index}")
TimeSeriesSplit(gap=0, max_train_size=None, n_splits=5, test_size=None)
Fold 0:
  Train: index=[0]
  Test:  index=[1]
Fold 1:
  Train: index=[0 1]
  Test:  index=[2]
Fold 2:
  Train: index=[0 1 2]
  Test:  index=[3]
Fold 3:
  Train: index=[0 1 2 3]
  Test:  index=[4]
Fold 4:
  Train: index=[0 1 2 3 4]
  Test:  index=[5]
# Backtesting with sliding window
tscv_slide = TimeSeriesSplit(max_train_size=3, n_splits=3)
print(tscv_slide)
for i, (train_index, test_index) in enumerate(tscv_slide.split(X)):
    print(f"Fold {i}:")
    print(f"  Train: index={train_index}")
    print(f"  Test:  index={test_index}")
TimeSeriesSplit(gap=0, max_train_size=3, n_splits=3, test_size=None)
Fold 0:
  Train: index=[0 1 2]
  Test:  index=[3]
Fold 1:
  Train: index=[1 2 3]
  Test:  index=[4]
Fold 2:
  Train: index=[2 3 4]
  Test:  index=[5]
# Backtesting with expanding window in the OilExchange data
tscv_expand = TimeSeriesSplit(n_splits=10)

# The segments
max_train = []
for i, (train_index, test_index) in enumerate(tscv_expand.split(OilExchange_train.loc[:, 'PerEURO'])):
    print(f"Fold {i}:")
    print(f"  Train: index={train_index}")
    max_train.append(max(train_index))
    print(f"  Test:  index={test_index}")
Fold 0:
  Train: index=[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15]
  Test:  index=[16 17 18 19 20 21 22 23 24 25 26 27 28 29]
Fold 1:
  Train: index=[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29]
  Test:  index=[30 31 32 33 34 35 36 37 38 39 40 41 42 43]
Fold 2:
  Train: index=[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43]
  Test:  index=[44 45 46 47 48 49 50 51 52 53 54 55 56 57]
Fold 3:
  Train: index=[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
 48 49 50 51 52 53 54 55 56 57]
  Test:  index=[58 59 60 61 62 63 64 65 66 67 68 69 70 71]
Fold 4:
  Train: index=[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71]
  Test:  index=[72 73 74 75 76 77 78 79 80 81 82 83 84 85]
Fold 5:
  Train: index=[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
 72 73 74 75 76 77 78 79 80 81 82 83 84 85]
  Test:  index=[86 87 88 89 90 91 92 93 94 95 96 97 98 99]
Fold 6:
  Train: index=[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
 96 97 98 99]
  Test:  index=[100 101 102 103 104 105 106 107 108 109 110 111 112 113]
Fold 7:
  Train: index=[  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35
  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53
  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71
  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89
  90  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107
 108 109 110 111 112 113]
  Test:  index=[114 115 116 117 118 119 120 121 122 123 124 125 126 127]
Fold 8:
  Train: index=[  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35
  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53
  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71
  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89
  90  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107
 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125
 126 127]
  Test:  index=[128 129 130 131 132 133 134 135 136 137 138 139 140 141]
Fold 9:
  Train: index=[  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35
  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53
  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71
  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89
  90  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107
 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125
 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141]
  Test:  index=[142 143 144 145 146 147 148 149 150 151 152 153 154 155]
# Backtesting using expanding window with data
from sklearn.model_selection import cross_validate
scores = cross_validate(pipe, OilExchange_train.loc[:, OilExchange_train.columns[3:-6]], \
                                 OilExchange_train.loc[:, 'PerEURO'], cv=tscv_expand, \
                                    scoring='r2', return_train_score=True)
scores
{'fit_time': array([0.00195408, 0.00175095, 0.00210977, 0.00187683, 0.00237203,
        0.00539994, 0.00127292, 0.0011971 , 0.00227928, 0.00190496]),
 'score_time': array([0.00100303, 0.00081325, 0.00069714, 0.00062108, 0.00072694,
        0.00061488, 0.00054789, 0.00053978, 0.00095177, 0.00106192]),
 'test_score': array([-2.58120320e+03, -4.60283547e+00, -7.14984703e+00,  1.63052775e-01,
        -5.45069739e+00, -1.00734416e+11, -1.62104880e+00, -2.82911176e-02,
        -3.42287951e+00, -1.33079128e+01]),
 'train_score': array([1.        , 0.95679732, 0.92352087, 0.85969706, 0.84590771,
        0.78920227, 0.69881999, 0.68809706, 0.65576638, 0.64771763])}
# Plot the backtesting results for train and test data, and under it ad the original data (PerEURO) as a subplot
plt.subplot(2,1,1)
plt.plot(scores['train_score'], label='Train')
plt.plot(scores['test_score'], label='Test')
plt.xlabel('Fold')
plt.ylabel('R$^2$')
plt.title('Backtesting results')
plt.axhline(0, color='gray', linestyle='--')
plt.ylim(-1.6,1.1)
plt.legend()
plt.subplot(2,1,2)
plt.plot(OilExchange_train.loc[:, 'PerEURO'])
for i in range(10):
    plt.axvline(x=max_train[i], color='gray', linestyle='--')
plt.xlabel('Time')
plt.ylabel('PerEURO')
plt.show()
../../_images/cdee03cec2e24e0145d9b873168ef71efcccf5e83963adac003bc45bbc6f0431.png

Question: Does the behaviour make sense with regard to what is included in and predicted from the model?

# Backtesting with sliding window in the OilExchange data
tscv_slide = TimeSeriesSplit(max_train_size=45, n_splits=10)

# The segments
max_train = []
for i, (train_index, test_index) in enumerate(tscv_slide.split(OilExchange_train.loc[:, 'PerEURO'])):
    print(f"Fold {i}:")
    print(f"  Train: index={train_index}")
    max_train.append(max(train_index))
    print(f"  Test:  index={test_index}")
Fold 0:
  Train: index=[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15]
  Test:  index=[16 17 18 19 20 21 22 23 24 25 26 27 28 29]
Fold 1:
  Train: index=[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29]
  Test:  index=[30 31 32 33 34 35 36 37 38 39 40 41 42 43]
Fold 2:
  Train: index=[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43]
  Test:  index=[44 45 46 47 48 49 50 51 52 53 54 55 56 57]
Fold 3:
  Train: index=[13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57]
  Test:  index=[58 59 60 61 62 63 64 65 66 67 68 69 70 71]
Fold 4:
  Train: index=[27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71]
  Test:  index=[72 73 74 75 76 77 78 79 80 81 82 83 84 85]
Fold 5:
  Train: index=[41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85]
  Test:  index=[86 87 88 89 90 91 92 93 94 95 96 97 98 99]
Fold 6:
  Train: index=[55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78
 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99]
  Test:  index=[100 101 102 103 104 105 106 107 108 109 110 111 112 113]
Fold 7:
  Train: index=[ 69  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86
  87  88  89  90  91  92  93  94  95  96  97  98  99 100 101 102 103 104
 105 106 107 108 109 110 111 112 113]
  Test:  index=[114 115 116 117 118 119 120 121 122 123 124 125 126 127]
Fold 8:
  Train: index=[ 83  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100
 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118
 119 120 121 122 123 124 125 126 127]
  Test:  index=[128 129 130 131 132 133 134 135 136 137 138 139 140 141]
Fold 9:
  Train: index=[ 97  98  99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114
 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132
 133 134 135 136 137 138 139 140 141]
  Test:  index=[142 143 144 145 146 147 148 149 150 151 152 153 154 155]
# Backtesting using sliding window with data
from sklearn.model_selection import cross_validate
scores = cross_validate(pipe, OilExchange_train.loc[:, OilExchange_train.columns[3:-4]], \
                                 OilExchange_train.loc[:, 'PerEURO'], cv=tscv_slide, \
                                    scoring='r2', return_train_score=True)
scores
{'fit_time': array([0.00230217, 0.001899  , 0.00173306, 0.00146484, 0.00147605,
        0.00135684, 0.00143504, 0.00137067, 0.00139475, 0.00138879]),
 'score_time': array([0.00081801, 0.00082326, 0.00068307, 0.00069213, 0.00061488,
        0.00069833, 0.00060296, 0.00066924, 0.00062108, 0.00066304]),
 'test_score': array([-2.69048314e+03, -1.41149796e+00, -1.48578129e+00,  2.15110053e-01,
        -7.83627964e+00, -1.40606556e+12,  5.20015261e-01, -2.78711930e+00,
        -7.94540728e+04,  3.87646039e-01]),
 'train_score': array([1.        , 0.99453091, 0.95716141, 0.94470747, 0.96005874,
        0.90744159, 0.75351647, 0.96129276, 0.93446863, 0.94984339])}
# Plot the backtesting results for train and test data, and under it ad the original data (PerEURO) as a subplot
plt.subplot(2,1,1)
plt.plot(scores['train_score'], label='Train')
plt.plot(scores['test_score'], label='Test')
plt.xlabel('Fold')
plt.ylabel('R$^2$')
plt.title('Backtesting results')
plt.axhline(0, color='gray', linestyle='--')
plt.ylim(-1.6,1.1)
plt.legend()
plt.subplot(2,1,2)
plt.plot(OilExchange_train.loc[:, 'PerEURO'])
for i in range(10):
    plt.axvline(x=max_train[i], color='gray', linestyle='--')
plt.xlabel('Time')
plt.ylabel('PerEURO')
plt.show()
../../_images/2f43dacf0a71e4b718d2953248f87ad21a254f4cd998ce4311575ea6b6e207e4.png

Question: Again; does the behaviour make sense with regard to what is included in and predicted from the model?

Exercise#

  • Repeat the PerEuro predictions, but exchange LinearRegression with scikit-learns’s PLSRegression.

  • Check if the number of components in the PLS model has an effect on the explained variance (\(\text{R}^2\)), either manually or using a GridSearchCV.

Including the response variable in the predictors#

  • As long as the training and test sets are not overlapping, we can include the response as a predictor.

  • Adding the response lagged can be done as a single variable or several variables (i.e., several different lags).

  • We will later look at ARIMA-type models where time lag is the main mechanism for modelling.

# Add the Per Euro column to the OilExchange data but shifted 1 timepoint backwards (and backfill last value)
OilExchange_train['PerEURO_lag1'] = OilExchange_train.PerEURO.shift(1).bfill()
OilExchange_train.head()
Date PerEURO PerUSD KeyIntRate LoanIntRate EuroIntRate CPI OilSpotPrice ImpOldShip ImpNewShip ... TrBal TrBalExShipOilPlat TrBalMland ly.var l2y.var l.CPI ExcChange Testrain season PerEURO_lag1
0 2000-01-01 8.1215 8.0129 5.500000 7.500000 3.04 104.1 25.855741 114 915 ... 18575 19238 -3257 8.0968 8.1907 103.6 Increase True winter 8.1215
1 2000-02-01 8.0991 8.2361 5.500000 7.500000 3.28 104.6 27.317905 527 359 ... 14217 17200 -4529 8.1215 8.0968 104.1 Decrease True winter 8.1215
2 2000-03-01 8.1110 8.4111 5.500000 7.500000 3.51 104.7 26.509183 1385 929 ... 13697 18380 -5562 8.0991 8.1215 104.6 Increase True Spring 8.0991
3 2000-04-01 8.1538 8.6081 5.632353 7.632353 3.69 105.1 21.558821 450 2194 ... 13142 15499 -5147 8.1110 8.0991 104.7 Increase True Spring 8.1110
4 2000-05-01 8.2015 9.0471 5.750000 7.750000 3.92 105.1 25.147242 239 608 ... 17733 18505 -5732 8.1538 8.1110 105.1 Increase True Spring 8.1538

5 rows × 29 columns

# Backtesting using sliding window with data
from sklearn.model_selection import cross_validate #       Negative indexing is scary!      -->
scores = cross_validate(pipe, pd.concat([OilExchange_train.loc[:, OilExchange_train.columns[3:-7]], OilExchange_train["PerEURO_lag1"]], axis=1), \
                                 OilExchange_train.loc[:, 'PerEURO'], cv=tscv_slide, \
                                    scoring='r2', return_train_score=True)
scores
{'fit_time': array([0.00173879, 0.0019269 , 0.00178409, 0.00138211, 0.00150299,
        0.0016129 , 0.00126004, 0.00336099, 0.00129914, 0.00119996]),
 'score_time': array([0.0009172 , 0.00081825, 0.00063276, 0.00068784, 0.0007329 ,
        0.00077915, 0.00056481, 0.00095487, 0.00057602, 0.00055313]),
 'test_score': array([-3.23917569e+00, -1.51173590e+00, -2.91230071e+00,  1.34622553e-01,
        -6.81986837e+00, -1.02963323e+12,  5.94148027e-01, -2.01189464e+00,
        -1.67957480e+04,  3.97449531e-01]),
 'train_score': array([1.        , 0.9921763 , 0.95255996, 0.94041573, 0.95467944,
        0.88697053, 0.75049267, 0.95403986, 0.93250471, 0.94798768])}
# Plot the backtesting results for train and test data, and under it ad the original data (PerEURO) as a subplot
plt.subplot(2,1,1)
plt.plot(scores['train_score'], label='Train')
plt.plot(scores['test_score'], label='Test')
plt.xlabel('Fold')
plt.ylabel('R^2')
plt.title('Backtesting results')
plt.axhline(0, color='gray', linestyle='--')
plt.ylim(-1.6,1.1)
plt.legend()
plt.subplot(2,1,2)
plt.plot(OilExchange_train.loc[:, 'PerEURO'])
for i in range(10):
    plt.axvline(x=max_train[i], color='gray', linestyle='--')
plt.xlabel('Time')
plt.ylabel('PerEURO')
plt.show()
../../_images/2de7b0e9f89e4ac2fafc5391fcae1b017c76a8820986c3b07ecc723d35c66053.png

Five lags#

OilExchange_train['PerEURO_lag2'] = OilExchange_train.PerEURO.shift(2).bfill()
OilExchange_train['PerEURO_lag3'] = OilExchange_train.PerEURO.shift(3).bfill()
OilExchange_train['PerEURO_lag4'] = OilExchange_train.PerEURO.shift(4).bfill()
OilExchange_train['PerEURO_lag5'] = OilExchange_train.PerEURO.shift(5).bfill()
OilExchange_train.head()
Date PerEURO PerUSD KeyIntRate LoanIntRate EuroIntRate CPI OilSpotPrice ImpOldShip ImpNewShip ... l2y.var l.CPI ExcChange Testrain season PerEURO_lag1 PerEURO_lag2 PerEURO_lag3 PerEURO_lag4 PerEURO_lag5
0 2000-01-01 8.1215 8.0129 5.500000 7.500000 3.04 104.1 25.855741 114 915 ... 8.1907 103.6 Increase True winter 8.1215 8.1215 8.1215 8.1215 8.1215
1 2000-02-01 8.0991 8.2361 5.500000 7.500000 3.28 104.6 27.317905 527 359 ... 8.0968 104.1 Decrease True winter 8.1215 8.1215 8.1215 8.1215 8.1215
2 2000-03-01 8.1110 8.4111 5.500000 7.500000 3.51 104.7 26.509183 1385 929 ... 8.1215 104.6 Increase True Spring 8.0991 8.1215 8.1215 8.1215 8.1215
3 2000-04-01 8.1538 8.6081 5.632353 7.632353 3.69 105.1 21.558821 450 2194 ... 8.0991 104.7 Increase True Spring 8.1110 8.0991 8.1215 8.1215 8.1215
4 2000-05-01 8.2015 9.0471 5.750000 7.750000 3.92 105.1 25.147242 239 608 ... 8.1110 105.1 Increase True Spring 8.1538 8.1110 8.0991 8.1215 8.1215

5 rows × 33 columns

# Backtesting using sliding window with data
from sklearn.model_selection import cross_validate #       Negative indexing is scary!      -->
scores = cross_validate(pipe, pd.concat([OilExchange_train.loc[:, OilExchange_train.columns[3:-11]], 
                                         OilExchange_train[["PerEURO_lag1","PerEURO_lag2","PerEURO_lag3","PerEURO_lag4","PerEURO_lag5"]]], axis=1),
                                         OilExchange_train.loc[:, 'PerEURO'], cv=tscv_slide,
                                         scoring='r2', return_train_score=True)
scores
{'fit_time': array([0.0021739 , 0.00212979, 0.00186086, 0.00131178, 0.00125098,
        0.00133705, 0.00155401, 0.0014317 , 0.00167513, 0.00162411]),
 'score_time': array([0.00083423, 0.00070715, 0.00065827, 0.0005579 , 0.00057292,
        0.00090098, 0.00063491, 0.00063896, 0.00060487, 0.00072289]),
 'test_score': array([-5.69820756e-01, -1.31098583e+00, -4.22244308e+00, -1.10809063e+00,
        -5.93617580e+00, -1.55149966e+12,  5.51939962e-01, -2.81878260e+00,
        -2.51925736e+00,  7.09248896e-01]),
 'train_score': array([1.        , 0.99617589, 0.97075986, 0.95156849, 0.96162574,
        0.92555208, 0.76327123, 0.97259431, 0.95439249, 0.96566249])}
# Plot the backtesting results for train and test data, and under it ad the original data (PerEURO) as a subplot
plt.subplot(2,1,1)
plt.plot(scores['train_score'], label='Train')
plt.plot(scores['test_score'], label='Test')
plt.xlabel('Fold')
plt.ylabel('R^2')
plt.title('Backtesting results')
plt.axhline(0, color='gray', linestyle='--')
plt.ylim(-1.6,1.1)
plt.legend()
plt.subplot(2,1,2)
plt.plot(OilExchange_train.loc[:, 'PerEURO'])
for i in range(10):
    plt.axvline(x=max_train[i], color='gray', linestyle='--')
plt.xlabel('Time')
plt.ylabel('PerEURO')
plt.show()
../../_images/7289de878e61450bce4821b89e361279e123c80596826043a32ccf0e283b4942.png

Correlation between time series#

  • To get an impression of the connection between different variables, one can compute correlations, e.g., in the form of correlation a matrix.

  • If one expects one variable to affect another variable at a later time, correlation with a lag can be computed.

  • The degree of connection between two time series may also be dependent on time.

    • A Sliding Window Correlation (SWC) shows local correlation in time windows.

    • The window size (and possible lag) can be tuned for series of quick or slow changes.

  • Note: Correlation does not equal causation.

    • There may not be a cause and effect, even though two phenomena show similar patterns. Beautifully illustrated by Tyler Vigen.

  • The concept of Autocorrelation will be covered later.

PerEURO_ExpNatGas_corr = np.corrcoef(OilExchange['PerEURO'], OilExchange['ExpNatGas'])
PerEURO_ExpNatGas_corr_lagged = np.corrcoef(OilExchange['PerEURO'][10:], OilExchange['ExpNatGas'][0:len(OilExchange['ExpNatGas'])-10])
print("Correlation between PerEURO and ExpNatGas: {:.3f}".format(PerEURO_ExpNatGas_corr[0,1]))
print("Correlation between PerEURO and ExpNatGas lagged 10 timepoints: {:.3f}".format(PerEURO_ExpNatGas_corr_lagged[0,1]))
Correlation between PerEURO and ExpNatGas: -0.000
Correlation between PerEURO and ExpNatGas lagged 10 timepoints: 0.093
# Use ipywidgets to create a slider for the lag
from ipywidgets import interact
def lagged_correlation(lag=0):
    x.index += lag
    corr = np.corrcoef(y[lag:], x[0:len(y)-lag])
    print("Correlation between {} and {} lagged {} timepoints: {:.3f}".format(x.name, y.name, lag, corr[0,1]))

x = OilExchange['ExpNatGas']
y = OilExchange['PerEURO']
interact(lagged_correlation, lag=(0,100,1)); # Semi-colon to suppress output
# Sliding window correlation with window size 45
PerEURO_ExpNatGas_SWC = OilExchange['PerEURO'].rolling(45, center=True).corr(OilExchange['ExpNatGas'])

# Plot PerEURO, ExpNatGas and PerEURO_ExpNatGas_SWC as subplots
def plot_SWC(center=22):
    plt.subplot(3,1,1)
    plt.plot(OilExchange['PerEURO'])
    plt.plot(range(center-22,center+22), OilExchange['PerEURO'][center-22:center+22], color="red")
    plt.ylabel('PerEURO')
    plt.xlim(0, len(OilExchange['PerEURO']))
    plt.subplot(3,1,2)
    plt.plot(OilExchange['ExpNatGas'])
    plt.plot(range(center-22,center+22), OilExchange['ExpNatGas'][center-22:center+22], color="red")
    plt.ylabel('ExpNatGas')
    plt.xlim(0, len(OilExchange['PerEURO']))
    plt.subplot(3,1,3)
    plt.plot(PerEURO_ExpNatGas_SWC)
    plt.plot(center, PerEURO_ExpNatGas_SWC[center], 'r.')
    plt.axhline(y=0, color='gray', linestyle=':')
    plt.ylim(-1,1)
    plt.xlim(0, len(OilExchange['PerEURO']))
    plt.xlabel('Time')
    plt.ylabel('SWC')
    plt.show()

interact(plot_SWC, center=(22,len(OilExchange['PerEURO'])-23,1)); # Semi-colon to suppress output

Note

If the lag approaches the length of the series, few points are included in the calculations.

OilExchange['ExpNatGas']
0       4054
1       3803
2       4017
3       3331
4       2642
       ...  
174    13766
175    13069
176    13946
177    21988
178    19070
Name: ExpNatGas, Length: 179, dtype: int64

Pandas’ rolling()#

  • When applying Pandas’ rolling() function, the index is used for matching the data points.

  • Therefore, we need to shift the index of the ExpNatGas to achieve a lag.

  • Because of the sliding window, the two series do not need to match in length.

OE = OilExchange['ExpNatGas'].copy() # <- Remember to copy, to avoid changing the original data!
OE.index += 10
plt.plot(OilExchange['PerEURO'].rolling(45, center=True).corr(OE))
plt.xlim(0, len(OilExchange['PerEURO']))
plt.show()
OE.index
../../_images/170a5ff53c035d0901cb7a366c5f73ccd3416930d07a3957718e9ebef7cd3f96.png
RangeIndex(start=10, stop=189, step=1)

Exercise#

  • Combine lag and sliding window correlation.

  • Use ipywidgets to control:

    • window width

    • lag

    • selected variable to compare to PerEURO

    • bonus: visualize the sliding window like above