SIG Data Exercise

The goal of this exercise is to use per-minute pricing data from six stocks in order to estimate the volatility over the upcoming month, as well as to provide a measure of our confidence in that guess.

Results:

The results, obtained below, are summarized as:

----------------------------------------
Annualized Volatility
----------------------------------------
Stock "a": 14.0% +/- 0.8%
Stock "b": 23.7% +/- 2.3%
Stock "c": 15.3% +/- 2.4%
Stock "d": 13.0% +/- 0.8%
Stock "e": 16.8% +/- 0.9%
Stock "f": 17.3% +/- 1.1%

Contents:

In [1]:
import numpy as np
import pandas as pd
from scipy import ndimage, stats, optimize

import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, cross_val_score, ShuffleSplit
from sklearn.metrics import mean_squared_error

%matplotlib inline
In [2]:
plt.rc('figure', facecolor='w', dpi=110)

Raw Data

We begin by reading the provided data and extracting time-of-day information:

In [3]:
data = pd.read_csv('stockdata3.zip')
In [4]:
data['hour'] = [int(t[:2]) for t in data['timestr']]
data['minute'] = [int(t[3:5]) for t in data['timestr']]
data['dayfrac'] = data['hour'] / 24 + data['minute'] / 60 / 24
data['time'] = data['day'] + data['dayfrac']

We check the basic structure of the data:

In [5]:
data.head()
Out[5]:
day timestr a b c d e f hour minute dayfrac time
0 1 09:30:00 325.450 13.795 94.500 49.985 49.93 17.025 9 30 0.395833 1.395833
1 1 09:31:00 325.245 13.890 94.515 49.990 49.96 17.025 9 31 0.396528 1.396528
2 1 09:32:00 325.580 13.905 94.565 49.995 49.96 17.025 9 32 0.397222 1.397222
3 1 09:33:00 325.470 13.955 94.645 50.065 49.92 17.025 9 33 0.397917 1.397917
4 1 09:34:00 325.295 13.975 94.580 50.030 49.90 17.025 9 34 0.398611 1.398611
In [6]:
data.tail()
Out[6]:
day timestr a b c d e f hour minute dayfrac time
98347 362 15:56:00 404.825 5.515 45.352 45.305 43.57 10.745 15 56 0.663889 362.663889
98348 362 15:57:00 404.815 5.615 45.362 45.795 43.61 10.745 15 57 0.664583 362.664583
98349 362 15:58:00 404.915 5.625 45.352 45.805 43.63 10.755 15 58 0.665278 362.665278
98350 362 15:59:00 404.975 5.615 45.333 45.795 43.62 10.755 15 59 0.665972 362.665972
98351 362 16:00:00 404.995 5.605 45.303 45.755 43.65 10.755 16 0 0.666667 362.666667
In [7]:
data.describe()
Out[7]:
day a b c d e f hour minute dayfrac time
count 98352.00000 98281.000000 98352.000000 98321.000000 98334.000000 98352.000000 96981.000000 98352.000000 98352.000000 98352.000000 98352.000000
mean 181.29732 363.767737 10.048836 69.445570 49.803747 47.727738 13.204838 12.237158 30.577416 0.531116 181.828436
std 104.61705 28.281333 3.332643 28.675762 4.539923 7.053875 2.404705 1.894421 17.326864 0.078377 104.616893
min 1.00000 0.000000 4.885000 33.807000 1.000000 38.550000 9.365000 9.000000 0.000000 0.395833 1.395833
25% 89.00000 341.785000 6.605000 45.330000 46.175000 42.040000 10.945000 11.000000 16.000000 0.463194 89.635937
50% 180.00000 357.915000 10.595000 50.960000 50.105000 45.410000 13.325000 12.000000 32.000000 0.531250 180.604514
75% 271.00000 387.890000 13.465000 101.905000 53.735000 51.350000 15.035000 14.000000 46.000000 0.599306 271.573090
max 362.00000 426.560000 15.335000 116.385000 62.137000 69.210000 18.815000 16.000000 59.000000 0.666667 362.666667

The number of days and number of data points per day are:

In [8]:
np.unique(data['day']).size
Out[8]:
252
In [9]:
np.unique(data['dayfrac']).size
Out[9]:
391

We can see that data is only given on weekdays, and some weekends are long weekends:

In [10]:
np.unique(data['day'])[:30]
Out[10]:
array([ 1,  2,  3,  4,  5,  8,  9, 10, 11, 12, 16, 17, 18, 19, 22, 23, 24,
       25, 26, 29, 30, 31, 32, 33, 36, 37, 38, 39, 40, 43])

Initial Exploration

The raw data looks like so:

In [11]:
STOCKS = 'a b c d e f'.split()
In [12]:
def plot_timeseries(data, floor=None, axs=None, **kw):
    if axs is None:
        fig, axs = plt.subplots(3, 2, figsize=(8,5))
    else:
        fig = axs[0,0].figure
    for (stock, ax) in zip(STOCKS, axs.ravel()):
        ax.plot(data['time'], data[stock], lw=1, **kw)
        ax.set_title(stock)
        if floor is not None:
            ax.set_ylim(floor)
    plt.tight_layout()
    return fig, axs

plot_timeseries(data);
In [13]:
def plot_hist(data):
    fig, axs = plt.subplots(2, 3, figsize=(9,5))
    for (stock, ax) in zip(STOCKS, axs.ravel()):
        ax.hist(data[stock], bins=20)
        ax.set_title(stock)
    plt.tight_layout()
    
plot_hist(data)
/home/mike/.local/lib/python3.7/site-packages/numpy/lib/histograms.py:824: RuntimeWarning: invalid value encountered in greater_equal
  keep = (tmp_a >= first_edge)
/home/mike/.local/lib/python3.7/site-packages/numpy/lib/histograms.py:825: RuntimeWarning: invalid value encountered in less_equal
  keep &= (tmp_a <= last_edge)

We can see that some samples are missing:

In [14]:
data.isna().sum()
Out[14]:
day           0
timestr       0
a            71
b             0
c            31
d            18
e             0
f          1371
hour          0
minute        0
dayfrac       0
time          0
dtype: int64

We also see that for some stocks, a number of data points drop to zero or near-zero in an unrealistic way:

In [15]:
data.min()
Out[15]:
day               1
timestr    09:30:00
a                 0
b             4.885
c            33.807
d                 1
e             38.55
f             9.365
hour              9
minute            0
dayfrac    0.395833
time        1.39583
dtype: object

Data Cleaning

We'll apply a simple cleaning where the unrealistic small values are treated as missing, and then missing data is forward-filled from the previous available data:

In [16]:
cleandata = data.copy()
for stock in STOCKS:
    col = cleandata[stock]
    cleandata[stock] = np.where(col > 1, col, np.nan)
cleandata.fillna(method='pad', inplace=True)
In [17]:
plot_timeseries(cleandata);

It may be instructive to take a closer look at the per-day data. Here's the first month:

In [18]:
dotkw = dict(ls='none', marker='.', ms=.5)
In [19]:
plot_timeseries(cleandata[cleandata['day'] <= 28], **dotkw);

And here's a zoom to the first couple of days:

In [20]:
plot_timeseries(cleandata[cleandata['day'] <= 4], **dotkw);

For some stocks, there is rapid variation, presumably from algorithmic high frequency trading, that does not reflect the overall trends. We will assume this could be separately modeled if needed, and so we remove this effect by smoothing with a Gaussian kernel with a width of one minute:

In [21]:
smoothdata = cleandata.copy()
for stock in STOCKS:
    smoothdata[stock] = ndimage.gaussian_filter(smoothdata[stock].values, 60) # 60 seconds
In [22]:
plot_timeseries(smoothdata, floor=0);

Again, we look at some closer zooms

In [23]:
for n in (28, 4):
    plot_timeseries(smoothdata[smoothdata['day'] <= n], **dotkw)

As a final sanity check, we compare the smoothed and unsmoothed data directly:

In [24]:
for n in (366, 28, 4):
    fig, axs = plot_timeseries(cleandata[cleandata['day'] <= n], alpha=.5, **dotkw)
    plot_timeseries(smoothdata[smoothdata['day'] <= n], axs=axs, color='C3', **dotkw);

Returns and Volatility

DISCLAIMER

I personally have very little prior knowledge of financial concepts such as volatility, so I will be making some potentially questionable assumptions in the following.

Approach

According to e.g. [1], volatility predictions are not greatly enhanced by using high frequency sampling. Therefore, we will simplify the problem to one regarding daily opening or closing prices.

We will not, however, follow that paper exactly. The authors work with log-returns and squared variance, but in my brief review these choices vary. I will be working with absolute fractional returns and RMS variance (i.e. standard deviation). The authors also recommend a number of choices for predictive features, whereas I will only use two choices I readily understand: RMS variance and range.

[1] Eric Ghyselsa, Pedro Santa-Clarab, Rossen Valkanov. 2006. "Predicting volatility: getting the most out of return data sampled at different frequencies", Journal of Econometrics. https://doi.org/10.1016/j.jeconom.2005.01.004

It's not obvious to me that either opening or closing prices are more appropriate. We can obtain and compare each series:

In [25]:
opening = smoothdata[(smoothdata['hour'].values == 9) & (smoothdata['minute'].values == 30)]
closing = smoothdata[(smoothdata['hour'].values == 16) & (smoothdata['minute'].values == 0)]
len(opening), len(closing)
Out[25]:
(252, 251)
In [26]:
fig, axs = plot_timeseries(opening, **dotkw)
plot_timeseries(opening, axs=axs, **dotkw);
In [27]:
fig, ax = plt.subplots(figsize=(3,3))
ax.plot(opening[STOCKS].values[:-1].ravel(), closing[STOCKS].values.ravel(), '.', ms=.5)
lim = 0, opening[STOCKS].values.max()
ax.plot(lim, lim, 'k--', lw=.5, zorder=-10);

Opening and closing prices are generally well-correlated, so we will stick with just opening price.

Next we compute month-over-month fractional returns. For the sake of simplicity in this exercise, we approximate that a month is the same as 20 trading days. With knowledge of real calendar dates, it may be more appropriate to use that information explicitly.

In [28]:
N = 20  # assuming 5 days * 4 weeks
returns = pd.DataFrame({
    stock: opening[stock].values[N:] / opening[stock].values[:-N] - 1
    for stock in STOCKS
})
returns['starttime'] = opening['time'].values[:-N]
returns['endtime'] = returns['time'] = opening['time'].values[N:]
In [29]:
returns.head()
Out[29]:
a b c d e f starttime endtime time
0 0.038173 -0.007060 0.023880 0.013695 0.052411 -0.065324 1.395833 30.395833 30.395833
1 0.044529 0.007673 0.041766 0.033276 0.050579 -0.075496 2.395833 31.395833 31.395833
2 0.036435 -0.000143 0.056578 0.038648 -0.025144 -0.109901 3.395833 32.395833 32.395833
3 0.030337 -0.015647 0.049419 0.034389 -0.027135 -0.134236 4.395833 33.395833 33.395833
4 0.043979 -0.002551 0.061552 0.058458 0.002142 -0.130663 5.395833 36.395833 36.395833
In [30]:
def plot_months(axs):
    """Helper function to add a monthly grid to timeseries plots."""
    for ax in np.ravel(axs):
        for d in range(0, 366, 28):
            ax.axvline(d, color='k', ls=':', lw=.5, alpha=.5)
In [31]:
fig, axs = plot_timeseries(returns, **dotkw)
plot_months(axs)

In this approach, returns are assigned to the end date of each month-long period, and thus we have no return information for the first month.

For our measure of historical volatility, we will use a rolling time window standard deviation. Here, too, we will use a month-long window.

In [32]:
monthly_stddev_returns = pd.DataFrame({
    stock: returns[stock].rolling(N).std()
    for stock in STOCKS
})
monthly_stddev_returns['time'] = returns['time'].values
In [33]:
fig, axs = plot_timeseries(monthly_stddev_returns, floor=0, **dotkw)
plot_months(axs)

From the above plots, we see that many stocks have some monthly oscillations in volatility -- depending on the reference point for start and end of the month -- that we can hope to model.

In [1], it was suggested that the readily-available range of returns, as opposed to the above standard deviation, is surprisingly effective for prediction. We compute that here:

In [34]:
monthly_range_returns = pd.DataFrame({
    stock: returns[stock].rolling(N).max() - returns[stock].rolling(N).min()
    for stock in STOCKS
})
monthly_range_returns['time'] = returns['time'].values
In [35]:
fig, axs = plot_timeseries(monthly_range_returns, floor=0, **dotkw)
plot_months(axs)

Supervised Learning Setup

We will now frame this as a supervised forecasting problem:

  • Input features: historical volatility over the last two months (approximated as 40 trading days)
  • Output prediction: implied volatility estimate over the next month

For supervised learning, we want an array X with shape (n_samples, n_features). First, we quickly validate the relevant list comprehension:

In [36]:
np.array([
    (i, stock)
    for i in range(4*N, len(monthly_stddev_returns))
    for stock in STOCKS
])
Out[36]:
array([['80', 'a'],
       ['80', 'b'],
       ['80', 'c'],
       ...,
       ['231', 'd'],
       ['231', 'e'],
       ['231', 'f']], dtype='<U21')

Next we construct predictive inputs, as well as target variable Y, from the historical data.

In [37]:
X_stddev = np.array([
    monthly_stddev_returns[stock].values[i-2*N:i]
    for i in range(4*N, len(monthly_stddev_returns)) for stock in STOCKS
])
X_range = np.array([
    monthly_range_returns[stock].values[i-2*N:i]
    for i in range(4*N, len(monthly_range_returns)) for stock in STOCKS
])
X_both = np.array([
    np.r_[
        monthly_stddev_returns[stock].values[i-2*N:i],
        monthly_range_returns[stock].values[i-2*N:i]]
    for i in range(4*N, len(monthly_range_returns)) for stock in STOCKS
])
Y = np.array([
    monthly_stddev_returns[stock].values[i]
    for i in range(4*N, len(monthly_stddev_returns)) for stock in STOCKS
])

We will also consider modeling each stock separately:

In [38]:
stocks_X = np.array([
    stock
    for i in range(4*N, len(monthly_stddev_returns)) for stock in STOCKS
])

Model Selection

Our primary model will be a straightforward linear regression. We will also consider a trivial persistence model, for which the prediction is simply the most recent input value:

In [39]:
class PersistenceRegression:
    def fit(self, X, y):
        return
    def get_params(self, *a, **kw):
        return {}
    def predict(self, X):
        return X[:,-1]

We will test combinations of model types and feature types:

In [40]:
models = [
    ('Persistence', PersistenceRegression(), 'STDDEV', X_stddev),
    ('LinearRegression', LinearRegression(), 'STDDEV', X_stddev),
    ('LinearRegression', LinearRegression(), 'RANGE', X_range),
    ('LinearRegression', LinearRegression(), 'BOTH', X_both),
]

Our model evaluation will be based on RMSE under cross validation with many overlapping folds and a small test set. We take this approach because our final model will be trained on the entire available time series, so we'd like our evaluation to reflect the performance closely. We will also display MSE to understand the impact of metric choice.

In [41]:
cv = ShuffleSplit(n_splits=30, train_size=0.9, random_state=37)
for (kind, model, features, X) in models:
    scores = -cross_val_score(model, X, Y, cv=cv, scoring='neg_mean_squared_error', )
    mse = np.mean(scores)
    rmse = np.mean(np.sqrt(scores))
    print('{:>30s} :  MSE = {:.6f} (all-stock model)'.format(f'{kind}/{features}', mse))
    print('{:>30s} : RMSE = {:.6f} (all-stock model)'.format(f'{kind}/{features}', rmse))
    scores = np.array([
        -cross_val_score(model, X[stocks_X==stock], Y[stocks_X==stock],
                         cv=cv, scoring='neg_mean_squared_error')
        for stock in STOCKS])
    mse = np.mean(scores)
    rmse = np.mean(np.sqrt(scores))
    print('{:>30s} :  MSE = {:.6f} (per-stock model)'.format(f'{kind}/{features}', mse))
    print('{:>30s} : RMSE = {:.6f} (per-stock model)'.format(f'{kind}/{features}', rmse))
    print()
            Persistence/STDDEV :  MSE = 0.000065 (all-stock model)
            Persistence/STDDEV : RMSE = 0.007793 (all-stock model)
            Persistence/STDDEV :  MSE = 0.000063 (per-stock model)
            Persistence/STDDEV : RMSE = 0.006233 (per-stock model)

       LinearRegression/STDDEV :  MSE = 0.000025 (all-stock model)
       LinearRegression/STDDEV : RMSE = 0.004800 (all-stock model)
       LinearRegression/STDDEV :  MSE = 0.000029 (per-stock model)
       LinearRegression/STDDEV : RMSE = 0.004160 (per-stock model)

        LinearRegression/RANGE :  MSE = 0.000261 (all-stock model)
        LinearRegression/RANGE : RMSE = 0.015983 (all-stock model)
        LinearRegression/RANGE :  MSE = 0.000328 (per-stock model)
        LinearRegression/RANGE : RMSE = 0.014146 (per-stock model)

         LinearRegression/BOTH :  MSE = 0.000025 (all-stock model)
         LinearRegression/BOTH : RMSE = 0.004773 (all-stock model)
         LinearRegression/BOTH :  MSE = 0.000037 (per-stock model)
         LinearRegression/BOTH : RMSE = 0.004857 (per-stock model)

Some interpretation of the above:

  • Unlike in [1], next-period volatility is better predicted by recent standard deviation data than by recent range data or the combination of both feature sets. It is possible I may have misunderstood proper construction of that feature.
  • Choice of metric is important to model selection. For the per-stock models, we average our metric over both folds and stocks. LinearRergression/STDDEV is the best model type, but whether all-stock or per-stock is "better" depends on whether we look at MSE or RMSE.

We will prodeed based on RMSE because this value is in the same units as the predicted volatility itself, thus forming the basis of the error estimate.

Final Model

Based on the above, we find that per-stock modeling using LinearRegression/STDDEV is the most effective. Thus we train one model per stock:

In [42]:
final_models = {
    stock: LinearRegression()
    for stock in STOCKS
}
for (stock, model) in final_models.items():
    model.fit(X_stddev[stocks_X==stock], Y[stocks_X==stock])

We will make our final predictions based on the final two months (40 trading days) of data for each stock:

In [43]:
final_X = {
    stock: np.array([monthly_stddev_returns[stock].values[-2*N:]])
    for stock in STOCKS
}

We briefly inspect the model coefficients to understand how time dependence has been extracted from the lookback window:

In [44]:
fig, ax = plt.subplots()
for stock in STOCKS:
    coef = final_models[stock].coef_
    coef = coef / np.max(np.abs(coef))
    ax.plot(coef, '.-', ls=':', label=stock)
ax.legend();

Error Estimation

We investigate how the RMSE scales with training size in order to estimate its value for the complete dataset:

In [45]:
train_sizes = np.r_[.1:.91:.1]
stocks_rmse = {stock: [] for stock in STOCKS}
for stock in STOCKS:
    for train_size in train_sizes:
        cv = ShuffleSplit(n_splits=10, train_size=train_size, random_state=37)
        mask = stocks_X == stock
        scores = -cross_val_score(model, X[mask], Y[mask], cv=cv, scoring='neg_mean_squared_error', )
        rmse = np.mean(np.sqrt(scores))
        stocks_rmse[stock].append(rmse)
In [46]:
fig, ax = plt.subplots()
for stock in STOCKS:
    ax.plot(train_sizes, stocks_rmse[stock], '.-', label=stock, alpha=.5)
ax.legend();

From this we see that the per-stock error estimates are well-behaved for train sizes >= 50% of the available time series. We'll use a simple fit to extrapolate the error when fitting on the entire time series:

In [47]:
train_sizes
Out[47]:
array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9])
In [48]:
def expfit(x, A, x0, C):
    return A * np.exp(-x/x0) + C

stocks_params = {}
stocks_errs = {}

for stock in STOCKS:
    p0 = (.5, .5,.01)
    params, cov = optimize.curve_fit(expfit, train_sizes[-5:], stocks_rmse[stock][-5:], p0=p0)
    stocks_params[stock] = params
    stocks_errs[stock] = expfit(1, *params)

We check that this is well-behaved:

In [49]:
fig, ax = plt.subplots()
for (i,stock) in enumerate(STOCKS):
    params = stocks_params[stock]
    color = f'C{i}'
    ax.plot(train_sizes, stocks_rmse[stock], '.:', label=stock, alpha=.5, color=color)
    x = np.linspace(.5, 1, 100)
    ax.plot(x, expfit(x, *params), color=color)
    ax.plot(1, stocks_errs[stock], 'o', color=color)
ax.legend();

With the possible exception of being overly optimistic about stock "c", whose data is disrupted by an apparent crash of some sort around day ~150, these seem like reasonable estimates.

Final Predictions

We finally are ready to give volatility predictions and confidence intervals for each stock:

In [50]:
ypreds = {stock: final_models[stock].predict(final_X[stock])[0] for stock in STOCKS}

The monthly volatility is:

In [51]:
print(40*'-')
print('Monthly Volatility')
print(40*'-')
for stock in STOCKS:
    print('Stock "{}": {:.1%} +/- {:.1%}'.format(stock, ypreds[stock], stocks_errs[stock]))
----------------------------------------
Monthly Volatility
----------------------------------------
Stock "a": 4.1% +/- 0.2%
Stock "b": 6.8% +/- 0.7%
Stock "c": 4.4% +/- 0.7%
Stock "d": 3.7% +/- 0.2%
Stock "e": 4.8% +/- 0.3%
Stock "f": 5.0% +/- 0.3%

Finally, for the annualized volatility, we multiply by $\sqrt(12)$ (TODO: the internet says to use the sqrt, but why?)

In [52]:
root12 = np.sqrt(12)
print(40*'-')
print('Annualized Volatility')
print(40*'-')
for stock in STOCKS:
    print('Stock "{}": {:.1%} +/- {:.1%}'.format(
        stock, root12 * ypreds[stock], root12 * stocks_errs[stock]))
----------------------------------------
Annualized Volatility
----------------------------------------
Stock "a": 14.0% +/- 0.8%
Stock "b": 23.7% +/- 2.3%
Stock "c": 15.3% +/- 2.4%
Stock "d": 13.0% +/- 0.8%
Stock "e": 16.8% +/- 0.9%
Stock "f": 17.3% +/- 1.1%

Conclusion

Here we have made predictions of future annualized monthly stock volatility based on per-minute pricing data. To this end, we performed several key steps:

  • Cleaning: We removed seemingly bad data, and then used a simple forward-fill method to ensure values at each time step.
  • Smoothing: We remove artifacts that look like high frequency trading, as that is probably better modeled separately. Thus we focus on overall trends, not rapid fluctuations.
  • Down-sampling: After smoothing, we reduce the data to 252 daily opening prices.
  • Feature and target engineering: We impose definitions of monthly return and monthly volatility, and consider both recent volatility and recent price ranges as predictors of future volatility.
  • Model selection: We test all-stock and per-stock models based on recent volatility and recent price ranges, and we compare with a naive persistence model.
  • Error estimation: We use cross-validation on a range of train/test split sizes to estimate the uncertainties of the final models trained with the entire time series.

This is my first serious look at time series stock data, so I may very well have important blind spots. Where necessary, I've substituted intuition for experience.

For serious work, we would want to consider more sophisticated models, such as higher order regressions or perhaps even RNNs. We would also potentially need to weight compute resource requirements against metric performance.

In a production environment, the exploratory analysis laid out above would need to be encapsulated into a clean, likely object-oriented codebase that could easily be retrained reapplied to previously unseen data.