SIG Data Exercise¶

The goal of this exercise is to use per-minute pricing data from six stocks in order to estimate the volatility over the upcoming month, as well as to provide a measure of our confidence in that guess.

Results:

The results, obtained below, are summarized as:

----------------------------------------
Annualized Volatility
----------------------------------------
Stock "a": 14.0% +/- 0.8%
Stock "b": 23.7% +/- 2.3%
Stock "c": 15.3% +/- 2.4%
Stock "d": 13.0% +/- 0.8%
Stock "e": 16.8% +/- 0.9%
Stock "f": 17.3% +/- 1.1%

Contents:

Raw Data
Initial-Exploration
Data Cleaning
Returns and Volatility
Supervised Learning Setup
Model Selection
Final Model
Error Estimation
Final Predictions
Conclusions

import numpy as np
import pandas as pd
from scipy import ndimage, stats, optimize

import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, cross_val_score, ShuffleSplit
from sklearn.metrics import mean_squared_error

%matplotlib inline

plt.rc('figure', facecolor='w', dpi=110)

Raw Data¶

We begin by reading the provided data and extracting time-of-day information:

data = pd.read_csv('stockdata3.zip')

data['hour'] = [int(t[:2]) for t in data['timestr']]
data['minute'] = [int(t[3:5]) for t in data['timestr']]
data['dayfrac'] = data['hour'] / 24 + data['minute'] / 60 / 24
data['time'] = data['day'] + data['dayfrac']

We check the basic structure of the data:

data.head()

data.tail()

data.describe()

The number of days and number of data points per day are:

np.unique(data['day']).size

252

np.unique(data['dayfrac']).size

391

We can see that data is only given on weekdays, and some weekends are long weekends:

np.unique(data['day'])[:30]

array([ 1,  2,  3,  4,  5,  8,  9, 10, 11, 12, 16, 17, 18, 19, 22, 23, 24,
       25, 26, 29, 30, 31, 32, 33, 36, 37, 38, 39, 40, 43])

Initial Exploration¶

The raw data looks like so:

STOCKS = 'a b c d e f'.split()

def plot_timeseries(data, floor=None, axs=None, **kw):
    if axs is None:
        fig, axs = plt.subplots(3, 2, figsize=(8,5))
    else:
        fig = axs[0,0].figure
    for (stock, ax) in zip(STOCKS, axs.ravel()):
        ax.plot(data['time'], data[stock], lw=1, **kw)
        ax.set_title(stock)
        if floor is not None:
            ax.set_ylim(floor)
    plt.tight_layout()
    return fig, axs

plot_timeseries(data);

def plot_hist(data):
    fig, axs = plt.subplots(2, 3, figsize=(9,5))
    for (stock, ax) in zip(STOCKS, axs.ravel()):
        ax.hist(data[stock], bins=20)
        ax.set_title(stock)
    plt.tight_layout()
    
plot_hist(data)

/home/mike/.local/lib/python3.7/site-packages/numpy/lib/histograms.py:824: RuntimeWarning: invalid value encountered in greater_equal
  keep = (tmp_a >= first_edge)
/home/mike/.local/lib/python3.7/site-packages/numpy/lib/histograms.py:825: RuntimeWarning: invalid value encountered in less_equal
  keep &= (tmp_a <= last_edge)

We can see that some samples are missing:

data.isna().sum()

day           0
timestr       0
a            71
b             0
c            31
d            18
e             0
f          1371
hour          0
minute        0
dayfrac       0
time          0
dtype: int64

We also see that for some stocks, a number of data points drop to zero or near-zero in an unrealistic way:

data.min()

day               1
timestr    09:30:00
a                 0
b             4.885
c            33.807
d                 1
e             38.55
f             9.365
hour              9
minute            0
dayfrac    0.395833
time        1.39583
dtype: object

Data Cleaning¶

We'll apply a simple cleaning where the unrealistic small values are treated as missing, and then missing data is forward-filled from the previous available data:

cleandata = data.copy()
for stock in STOCKS:
    col = cleandata[stock]
    cleandata[stock] = np.where(col > 1, col, np.nan)
cleandata.fillna(method='pad', inplace=True)

plot_timeseries(cleandata);

It may be instructive to take a closer look at the per-day data. Here's the first month:

dotkw = dict(ls='none', marker='.', ms=.5)

plot_timeseries(cleandata[cleandata['day'] <= 28], **dotkw);

And here's a zoom to the first couple of days:

plot_timeseries(cleandata[cleandata['day'] <= 4], **dotkw);

For some stocks, there is rapid variation, presumably from algorithmic high frequency trading, that does not reflect the overall trends. We will assume this could be separately modeled if needed, and so we remove this effect by smoothing with a Gaussian kernel with a width of one minute:

smoothdata = cleandata.copy()
for stock in STOCKS:
    smoothdata[stock] = ndimage.gaussian_filter(smoothdata[stock].values, 60) # 60 seconds

plot_timeseries(smoothdata, floor=0);

Again, we look at some closer zooms

for n in (28, 4):
    plot_timeseries(smoothdata[smoothdata['day'] <= n], **dotkw)

As a final sanity check, we compare the smoothed and unsmoothed data directly:

for n in (366, 28, 4):
    fig, axs = plot_timeseries(cleandata[cleandata['day'] <= n], alpha=.5, **dotkw)
    plot_timeseries(smoothdata[smoothdata['day'] <= n], axs=axs, color='C3', **dotkw);

Returns and Volatility¶

DISCLAIMER

I personally have very little prior knowledge of financial concepts such as volatility, so I will be making some potentially questionable assumptions in the following.

Approach

According to e.g. [1], volatility predictions are not greatly enhanced by using high frequency sampling. Therefore, we will simplify the problem to one regarding daily opening or closing prices.

We will not, however, follow that paper exactly. The authors work with log-returns and squared variance, but in my brief review these choices vary. I will be working with absolute fractional returns and RMS variance (i.e. standard deviation). The authors also recommend a number of choices for predictive features, whereas I will only use two choices I readily understand: RMS variance and range.

[1] Eric Ghyselsa, Pedro Santa-Clarab, Rossen Valkanov. 2006. "Predicting volatility: getting the most out of return data sampled at different frequencies", Journal of Econometrics. https://doi.org/10.1016/j.jeconom.2005.01.004

It's not obvious to me that either opening or closing prices are more appropriate. We can obtain and compare each series:

opening = smoothdata[(smoothdata['hour'].values == 9) & (smoothdata['minute'].values == 30)]
closing = smoothdata[(smoothdata['hour'].values == 16) & (smoothdata['minute'].values == 0)]
len(opening), len(closing)

(252, 251)

fig, axs = plot_timeseries(opening, **dotkw)
plot_timeseries(opening, axs=axs, **dotkw);

fig, ax = plt.subplots(figsize=(3,3))
ax.plot(opening[STOCKS].values[:-1].ravel(), closing[STOCKS].values.ravel(), '.', ms=.5)
lim = 0, opening[STOCKS].values.max()
ax.plot(lim, lim, 'k--', lw=.5, zorder=-10);

Opening and closing prices are generally well-correlated, so we will stick with just opening price.

Next we compute month-over-month fractional returns. For the sake of simplicity in this exercise, we approximate that a month is the same as 20 trading days. With knowledge of real calendar dates, it may be more appropriate to use that information explicitly.

N = 20  # assuming 5 days * 4 weeks
returns = pd.DataFrame({
    stock: opening[stock].values[N:] / opening[stock].values[:-N] - 1
    for stock in STOCKS
})
returns['starttime'] = opening['time'].values[:-N]
returns['endtime'] = returns['time'] = opening['time'].values[N:]

returns.head()

def plot_months(axs):
    """Helper function to add a monthly grid to timeseries plots."""
    for ax in np.ravel(axs):
        for d in range(0, 366, 28):
            ax.axvline(d, color='k', ls=':', lw=.5, alpha=.5)

fig, axs = plot_timeseries(returns, **dotkw)
plot_months(axs)

In this approach, returns are assigned to the end date of each month-long period, and thus we have no return information for the first month.

For our measure of historical volatility, we will use a rolling time window standard deviation. Here, too, we will use a month-long window.

monthly_stddev_returns = pd.DataFrame({
    stock: returns[stock].rolling(N).std()
    for stock in STOCKS
})
monthly_stddev_returns['time'] = returns['time'].values

fig, axs = plot_timeseries(monthly_stddev_returns, floor=0, **dotkw)
plot_months(axs)

From the above plots, we see that many stocks have some monthly oscillations in volatility -- depending on the reference point for start and end of the month -- that we can hope to model.

In [1], it was suggested that the readily-available range of returns, as opposed to the above standard deviation, is surprisingly effective for prediction. We compute that here:

monthly_range_returns = pd.DataFrame({
    stock: returns[stock].rolling(N).max() - returns[stock].rolling(N).min()
    for stock in STOCKS
})
monthly_range_returns['time'] = returns['time'].values

fig, axs = plot_timeseries(monthly_range_returns, floor=0, **dotkw)
plot_months(axs)

Supervised Learning Setup¶

We will now frame this as a supervised forecasting problem:

Input features: historical volatility over the last two months (approximated as 40 trading days)
Output prediction: implied volatility estimate over the next month

For supervised learning, we want an array X with shape (n_samples, n_features). First, we quickly validate the relevant list comprehension:

np.array([
    (i, stock)
    for i in range(4*N, len(monthly_stddev_returns))
    for stock in STOCKS
])

array([['80', 'a'],
       ['80', 'b'],
       ['80', 'c'],
       ...,
       ['231', 'd'],
       ['231', 'e'],
       ['231', 'f']], dtype='<U21')

Next we construct predictive inputs, as well as target variable Y, from the historical data.

X_stddev = np.array([
    monthly_stddev_returns[stock].values[i-2*N:i]
    for i in range(4*N, len(monthly_stddev_returns)) for stock in STOCKS
])
X_range = np.array([
    monthly_range_returns[stock].values[i-2*N:i]
    for i in range(4*N, len(monthly_range_returns)) for stock in STOCKS
])
X_both = np.array([
    np.r_[
        monthly_stddev_returns[stock].values[i-2*N:i],
        monthly_range_returns[stock].values[i-2*N:i]]
    for i in range(4*N, len(monthly_range_returns)) for stock in STOCKS
])
Y = np.array([
    monthly_stddev_returns[stock].values[i]
    for i in range(4*N, len(monthly_stddev_returns)) for stock in STOCKS
])

We will also consider modeling each stock separately:

stocks_X = np.array([
    stock
    for i in range(4*N, len(monthly_stddev_returns)) for stock in STOCKS
])

Model Selection¶

Our primary model will be a straightforward linear regression. We will also consider a trivial persistence model, for which the prediction is simply the most recent input value:

class PersistenceRegression:
    def fit(self, X, y):
        return
    def get_params(self, *a, **kw):
        return {}
    def predict(self, X):
        return X[:,-1]

We will test combinations of model types and feature types:

models = [
    ('Persistence', PersistenceRegression(), 'STDDEV', X_stddev),
    ('LinearRegression', LinearRegression(), 'STDDEV', X_stddev),
    ('LinearRegression', LinearRegression(), 'RANGE', X_range),
    ('LinearRegression', LinearRegression(), 'BOTH', X_both),
]

Our model evaluation will be based on RMSE under cross validation with many overlapping folds and a small test set. We take this approach because our final model will be trained on the entire available time series, so we'd like our evaluation to reflect the performance closely. We will also display MSE to understand the impact of metric choice.

cv = ShuffleSplit(n_splits=30, train_size=0.9, random_state=37)
for (kind, model, features, X) in models:
    scores = -cross_val_score(model, X, Y, cv=cv, scoring='neg_mean_squared_error', )
    mse = np.mean(scores)
    rmse = np.mean(np.sqrt(scores))
    print('{:>30s} :  MSE = {:.6f} (all-stock model)'.format(f'{kind}/{features}', mse))
    print('{:>30s} : RMSE = {:.6f} (all-stock model)'.format(f'{kind}/{features}', rmse))
    scores = np.array([
        -cross_val_score(model, X[stocks_X==stock], Y[stocks_X==stock],
                         cv=cv, scoring='neg_mean_squared_error')
        for stock in STOCKS])
    mse = np.mean(scores)
    rmse = np.mean(np.sqrt(scores))
    print('{:>30s} :  MSE = {:.6f} (per-stock model)'.format(f'{kind}/{features}', mse))
    print('{:>30s} : RMSE = {:.6f} (per-stock model)'.format(f'{kind}/{features}', rmse))
    print()

            Persistence/STDDEV :  MSE = 0.000065 (all-stock model)
            Persistence/STDDEV : RMSE = 0.007793 (all-stock model)
            Persistence/STDDEV :  MSE = 0.000063 (per-stock model)
            Persistence/STDDEV : RMSE = 0.006233 (per-stock model)

       LinearRegression/STDDEV :  MSE = 0.000025 (all-stock model)
       LinearRegression/STDDEV : RMSE = 0.004800 (all-stock model)
       LinearRegression/STDDEV :  MSE = 0.000029 (per-stock model)
       LinearRegression/STDDEV : RMSE = 0.004160 (per-stock model)

        LinearRegression/RANGE :  MSE = 0.000261 (all-stock model)
        LinearRegression/RANGE : RMSE = 0.015983 (all-stock model)
        LinearRegression/RANGE :  MSE = 0.000328 (per-stock model)
        LinearRegression/RANGE : RMSE = 0.014146 (per-stock model)

         LinearRegression/BOTH :  MSE = 0.000025 (all-stock model)
         LinearRegression/BOTH : RMSE = 0.004773 (all-stock model)
         LinearRegression/BOTH :  MSE = 0.000037 (per-stock model)
         LinearRegression/BOTH : RMSE = 0.004857 (per-stock model)

Some interpretation of the above:

Unlike in [1], next-period volatility is better predicted by recent standard deviation data than by recent range data or the combination of both feature sets. It is possible I may have misunderstood proper construction of that feature.
Choice of metric is important to model selection. For the per-stock models, we average our metric over both folds and stocks. LinearRergression/STDDEV is the best model type, but whether all-stock or per-stock is "better" depends on whether we look at MSE or RMSE.

We will prodeed based on RMSE because this value is in the same units as the predicted volatility itself, thus forming the basis of the error estimate.

Final Model¶

Based on the above, we find that per-stock modeling using LinearRegression/STDDEV is the most effective. Thus we train one model per stock:

final_models = {
    stock: LinearRegression()
    for stock in STOCKS
}
for (stock, model) in final_models.items():
    model.fit(X_stddev[stocks_X==stock], Y[stocks_X==stock])

We will make our final predictions based on the final two months (40 trading days) of data for each stock:

final_X = {
    stock: np.array([monthly_stddev_returns[stock].values[-2*N:]])
    for stock in STOCKS
}

We briefly inspect the model coefficients to understand how time dependence has been extracted from the lookback window:

fig, ax = plt.subplots()
for stock in STOCKS:
    coef = final_models[stock].coef_
    coef = coef / np.max(np.abs(coef))
    ax.plot(coef, '.-', ls=':', label=stock)
ax.legend();

Error Estimation¶

We investigate how the RMSE scales with training size in order to estimate its value for the complete dataset:

train_sizes = np.r_[.1:.91:.1]
stocks_rmse = {stock: [] for stock in STOCKS}
for stock in STOCKS:
    for train_size in train_sizes:
        cv = ShuffleSplit(n_splits=10, train_size=train_size, random_state=37)
        mask = stocks_X == stock
        scores = -cross_val_score(model, X[mask], Y[mask], cv=cv, scoring='neg_mean_squared_error', )
        rmse = np.mean(np.sqrt(scores))
        stocks_rmse[stock].append(rmse)

fig, ax = plt.subplots()
for stock in STOCKS:
    ax.plot(train_sizes, stocks_rmse[stock], '.-', label=stock, alpha=.5)
ax.legend();

From this we see that the per-stock error estimates are well-behaved for train sizes >= 50% of the available time series. We'll use a simple fit to extrapolate the error when fitting on the entire time series:

train_sizes

array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9])

def expfit(x, A, x0, C):
    return A * np.exp(-x/x0) + C

stocks_params = {}
stocks_errs = {}

for stock in STOCKS:
    p0 = (.5, .5,.01)
    params, cov = optimize.curve_fit(expfit, train_sizes[-5:], stocks_rmse[stock][-5:], p0=p0)
    stocks_params[stock] = params
    stocks_errs[stock] = expfit(1, *params)

We check that this is well-behaved:

fig, ax = plt.subplots()
for (i,stock) in enumerate(STOCKS):
    params = stocks_params[stock]
    color = f'C{i}'
    ax.plot(train_sizes, stocks_rmse[stock], '.:', label=stock, alpha=.5, color=color)
    x = np.linspace(.5, 1, 100)
    ax.plot(x, expfit(x, *params), color=color)
    ax.plot(1, stocks_errs[stock], 'o', color=color)
ax.legend();

With the possible exception of being overly optimistic about stock "c", whose data is disrupted by an apparent crash of some sort around day ~150, these seem like reasonable estimates.

Final Predictions¶

We finally are ready to give volatility predictions and confidence intervals for each stock:

ypreds = {stock: final_models[stock].predict(final_X[stock])[0] for stock in STOCKS}

The monthly volatility is:

print(40*'-')
print('Monthly Volatility')
print(40*'-')
for stock in STOCKS:
    print('Stock "{}": {:.1%} +/- {:.1%}'.format(stock, ypreds[stock], stocks_errs[stock]))

----------------------------------------
Monthly Volatility
----------------------------------------
Stock "a": 4.1% +/- 0.2%
Stock "b": 6.8% +/- 0.7%
Stock "c": 4.4% +/- 0.7%
Stock "d": 3.7% +/- 0.2%
Stock "e": 4.8% +/- 0.3%
Stock "f": 5.0% +/- 0.3%

Finally, for the annualized volatility, we multiply by $\sqrt(12)$ (TODO: the internet says to use the sqrt, but why?)

root12 = np.sqrt(12)
print(40*'-')
print('Annualized Volatility')
print(40*'-')
for stock in STOCKS:
    print('Stock "{}": {:.1%} +/- {:.1%}'.format(
        stock, root12 * ypreds[stock], root12 * stocks_errs[stock]))

----------------------------------------
Annualized Volatility
----------------------------------------
Stock "a": 14.0% +/- 0.8%
Stock "b": 23.7% +/- 2.3%
Stock "c": 15.3% +/- 2.4%
Stock "d": 13.0% +/- 0.8%
Stock "e": 16.8% +/- 0.9%
Stock "f": 17.3% +/- 1.1%

Conclusion¶

Here we have made predictions of future annualized monthly stock volatility based on per-minute pricing data. To this end, we performed several key steps:

Cleaning: We removed seemingly bad data, and then used a simple forward-fill method to ensure values at each time step.
Smoothing: We remove artifacts that look like high frequency trading, as that is probably better modeled separately. Thus we focus on overall trends, not rapid fluctuations.
Down-sampling: After smoothing, we reduce the data to 252 daily opening prices.
Feature and target engineering: We impose definitions of monthly return and monthly volatility, and consider both recent volatility and recent price ranges as predictors of future volatility.
Model selection: We test all-stock and per-stock models based on recent volatility and recent price ranges, and we compare with a naive persistence model.
Error estimation: We use cross-validation on a range of train/test split sizes to estimate the uncertainties of the final models trained with the entire time series.

This is my first serious look at time series stock data, so I may very well have important blind spots. Where necessary, I've substituted intuition for experience.

For serious work, we would want to consider more sophisticated models, such as higher order regressions or perhaps even RNNs. We would also potentially need to weight compute resource requirements against metric performance.

In a production environment, the exploratory analysis laid out above would need to be encapsulated into a clean, likely object-oriented codebase that could easily be retrained reapplied to previously unseen data.

	day	timestr	a	b	c	d	e	f	hour	minute	dayfrac	time
0	1	09:30:00	325.450	13.795	94.500	49.985	49.93	17.025	9	30	0.395833	1.395833
1	1	09:31:00	325.245	13.890	94.515	49.990	49.96	17.025	9	31	0.396528	1.396528
2	1	09:32:00	325.580	13.905	94.565	49.995	49.96	17.025	9	32	0.397222	1.397222
3	1	09:33:00	325.470	13.955	94.645	50.065	49.92	17.025	9	33	0.397917	1.397917
4	1	09:34:00	325.295	13.975	94.580	50.030	49.90	17.025	9	34	0.398611	1.398611

	day	timestr	a	b	c	d	e	f	hour	minute	dayfrac	time
98347	362	15:56:00	404.825	5.515	45.352	45.305	43.57	10.745	15	56	0.663889	362.663889
98348	362	15:57:00	404.815	5.615	45.362	45.795	43.61	10.745	15	57	0.664583	362.664583
98349	362	15:58:00	404.915	5.625	45.352	45.805	43.63	10.755	15	58	0.665278	362.665278
98350	362	15:59:00	404.975	5.615	45.333	45.795	43.62	10.755	15	59	0.665972	362.665972
98351	362	16:00:00	404.995	5.605	45.303	45.755	43.65	10.755	16	0	0.666667	362.666667

	day	a	b	c	d	e	f	hour	minute	dayfrac	time
count	98352.00000	98281.000000	98352.000000	98321.000000	98334.000000	98352.000000	96981.000000	98352.000000	98352.000000	98352.000000	98352.000000
mean	181.29732	363.767737	10.048836	69.445570	49.803747	47.727738	13.204838	12.237158	30.577416	0.531116	181.828436
std	104.61705	28.281333	3.332643	28.675762	4.539923	7.053875	2.404705	1.894421	17.326864	0.078377	104.616893
min	1.00000	0.000000	4.885000	33.807000	1.000000	38.550000	9.365000	9.000000	0.000000	0.395833	1.395833
25%	89.00000	341.785000	6.605000	45.330000	46.175000	42.040000	10.945000	11.000000	16.000000	0.463194	89.635937
50%	180.00000	357.915000	10.595000	50.960000	50.105000	45.410000	13.325000	12.000000	32.000000	0.531250	180.604514
75%	271.00000	387.890000	13.465000	101.905000	53.735000	51.350000	15.035000	14.000000	46.000000	0.599306	271.573090
max	362.00000	426.560000	15.335000	116.385000	62.137000	69.210000	18.815000	16.000000	59.000000	0.666667	362.666667

	a	b	c	d	e	f	starttime	endtime	time
0	0.038173	-0.007060	0.023880	0.013695	0.052411	-0.065324	1.395833	30.395833	30.395833
1	0.044529	0.007673	0.041766	0.033276	0.050579	-0.075496	2.395833	31.395833	31.395833
2	0.036435	-0.000143	0.056578	0.038648	-0.025144	-0.109901	3.395833	32.395833	32.395833
3	0.030337	-0.015647	0.049419	0.034389	-0.027135	-0.134236	4.395833	33.395833	33.395833
4	0.043979	-0.002551	0.061552	0.058458	0.002142	-0.130663	5.395833	36.395833	36.395833