The goal of this exercise is to use per-minute pricing data from six stocks in order to estimate the volatility over the upcoming month, as well as to provide a measure of our confidence in that guess.
Results:
The results, obtained below, are summarized as:
----------------------------------------
Annualized Volatility
----------------------------------------
Stock "a": 14.0% +/- 0.8%
Stock "b": 23.7% +/- 2.3%
Stock "c": 15.3% +/- 2.4%
Stock "d": 13.0% +/- 0.8%
Stock "e": 16.8% +/- 0.9%
Stock "f": 17.3% +/- 1.1%
Contents:
import numpy as np
import pandas as pd
from scipy import ndimage, stats, optimize
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, cross_val_score, ShuffleSplit
from sklearn.metrics import mean_squared_error
%matplotlib inline
plt.rc('figure', facecolor='w', dpi=110)
We begin by reading the provided data and extracting time-of-day information:
data = pd.read_csv('stockdata3.zip')
data['hour'] = [int(t[:2]) for t in data['timestr']]
data['minute'] = [int(t[3:5]) for t in data['timestr']]
data['dayfrac'] = data['hour'] / 24 + data['minute'] / 60 / 24
data['time'] = data['day'] + data['dayfrac']
We check the basic structure of the data:
data.head()
data.tail()
data.describe()
The number of days and number of data points per day are:
np.unique(data['day']).size
np.unique(data['dayfrac']).size
We can see that data is only given on weekdays, and some weekends are long weekends:
np.unique(data['day'])[:30]
The raw data looks like so:
STOCKS = 'a b c d e f'.split()
def plot_timeseries(data, floor=None, axs=None, **kw):
if axs is None:
fig, axs = plt.subplots(3, 2, figsize=(8,5))
else:
fig = axs[0,0].figure
for (stock, ax) in zip(STOCKS, axs.ravel()):
ax.plot(data['time'], data[stock], lw=1, **kw)
ax.set_title(stock)
if floor is not None:
ax.set_ylim(floor)
plt.tight_layout()
return fig, axs
plot_timeseries(data);
def plot_hist(data):
fig, axs = plt.subplots(2, 3, figsize=(9,5))
for (stock, ax) in zip(STOCKS, axs.ravel()):
ax.hist(data[stock], bins=20)
ax.set_title(stock)
plt.tight_layout()
plot_hist(data)
We can see that some samples are missing:
data.isna().sum()
We also see that for some stocks, a number of data points drop to zero or near-zero in an unrealistic way:
data.min()
We'll apply a simple cleaning where the unrealistic small values are treated as missing, and then missing data is forward-filled from the previous available data:
cleandata = data.copy()
for stock in STOCKS:
col = cleandata[stock]
cleandata[stock] = np.where(col > 1, col, np.nan)
cleandata.fillna(method='pad', inplace=True)
plot_timeseries(cleandata);
It may be instructive to take a closer look at the per-day data. Here's the first month:
dotkw = dict(ls='none', marker='.', ms=.5)
plot_timeseries(cleandata[cleandata['day'] <= 28], **dotkw);
And here's a zoom to the first couple of days:
plot_timeseries(cleandata[cleandata['day'] <= 4], **dotkw);
For some stocks, there is rapid variation, presumably from algorithmic high frequency trading, that does not reflect the overall trends. We will assume this could be separately modeled if needed, and so we remove this effect by smoothing with a Gaussian kernel with a width of one minute:
smoothdata = cleandata.copy()
for stock in STOCKS:
smoothdata[stock] = ndimage.gaussian_filter(smoothdata[stock].values, 60) # 60 seconds
plot_timeseries(smoothdata, floor=0);
Again, we look at some closer zooms
for n in (28, 4):
plot_timeseries(smoothdata[smoothdata['day'] <= n], **dotkw)
As a final sanity check, we compare the smoothed and unsmoothed data directly:
for n in (366, 28, 4):
fig, axs = plot_timeseries(cleandata[cleandata['day'] <= n], alpha=.5, **dotkw)
plot_timeseries(smoothdata[smoothdata['day'] <= n], axs=axs, color='C3', **dotkw);
DISCLAIMER
I personally have very little prior knowledge of financial concepts such as volatility, so I will be making some potentially questionable assumptions in the following.
Approach
According to e.g. [1], volatility predictions are not greatly enhanced by using high frequency sampling. Therefore, we will simplify the problem to one regarding daily opening or closing prices.
We will not, however, follow that paper exactly. The authors work with log-returns and squared variance, but in my brief review these choices vary. I will be working with absolute fractional returns and RMS variance (i.e. standard deviation). The authors also recommend a number of choices for predictive features, whereas I will only use two choices I readily understand: RMS variance and range.
[1] Eric Ghyselsa, Pedro Santa-Clarab, Rossen Valkanov. 2006. "Predicting volatility: getting the most out of return data sampled at different frequencies", Journal of Econometrics. https://doi.org/10.1016/j.jeconom.2005.01.004
It's not obvious to me that either opening or closing prices are more appropriate. We can obtain and compare each series:
opening = smoothdata[(smoothdata['hour'].values == 9) & (smoothdata['minute'].values == 30)]
closing = smoothdata[(smoothdata['hour'].values == 16) & (smoothdata['minute'].values == 0)]
len(opening), len(closing)
fig, axs = plot_timeseries(opening, **dotkw)
plot_timeseries(opening, axs=axs, **dotkw);
fig, ax = plt.subplots(figsize=(3,3))
ax.plot(opening[STOCKS].values[:-1].ravel(), closing[STOCKS].values.ravel(), '.', ms=.5)
lim = 0, opening[STOCKS].values.max()
ax.plot(lim, lim, 'k--', lw=.5, zorder=-10);
Opening and closing prices are generally well-correlated, so we will stick with just opening price.
Next we compute month-over-month fractional returns. For the sake of simplicity in this exercise, we approximate that a month is the same as 20 trading days. With knowledge of real calendar dates, it may be more appropriate to use that information explicitly.
N = 20 # assuming 5 days * 4 weeks
returns = pd.DataFrame({
stock: opening[stock].values[N:] / opening[stock].values[:-N] - 1
for stock in STOCKS
})
returns['starttime'] = opening['time'].values[:-N]
returns['endtime'] = returns['time'] = opening['time'].values[N:]
returns.head()
def plot_months(axs):
"""Helper function to add a monthly grid to timeseries plots."""
for ax in np.ravel(axs):
for d in range(0, 366, 28):
ax.axvline(d, color='k', ls=':', lw=.5, alpha=.5)
fig, axs = plot_timeseries(returns, **dotkw)
plot_months(axs)
In this approach, returns are assigned to the end date of each month-long period, and thus we have no return information for the first month.
For our measure of historical volatility, we will use a rolling time window standard deviation. Here, too, we will use a month-long window.
monthly_stddev_returns = pd.DataFrame({
stock: returns[stock].rolling(N).std()
for stock in STOCKS
})
monthly_stddev_returns['time'] = returns['time'].values
fig, axs = plot_timeseries(monthly_stddev_returns, floor=0, **dotkw)
plot_months(axs)
From the above plots, we see that many stocks have some monthly oscillations in volatility -- depending on the reference point for start and end of the month -- that we can hope to model.
In [1], it was suggested that the readily-available range of returns, as opposed to the above standard deviation, is surprisingly effective for prediction. We compute that here:
monthly_range_returns = pd.DataFrame({
stock: returns[stock].rolling(N).max() - returns[stock].rolling(N).min()
for stock in STOCKS
})
monthly_range_returns['time'] = returns['time'].values
fig, axs = plot_timeseries(monthly_range_returns, floor=0, **dotkw)
plot_months(axs)
We will now frame this as a supervised forecasting problem:
For supervised learning, we want an array X
with shape (n_samples, n_features). First, we quickly validate the relevant list comprehension:
np.array([
(i, stock)
for i in range(4*N, len(monthly_stddev_returns))
for stock in STOCKS
])
Next we construct predictive inputs, as well as target variable Y
, from the historical data.
X_stddev = np.array([
monthly_stddev_returns[stock].values[i-2*N:i]
for i in range(4*N, len(monthly_stddev_returns)) for stock in STOCKS
])
X_range = np.array([
monthly_range_returns[stock].values[i-2*N:i]
for i in range(4*N, len(monthly_range_returns)) for stock in STOCKS
])
X_both = np.array([
np.r_[
monthly_stddev_returns[stock].values[i-2*N:i],
monthly_range_returns[stock].values[i-2*N:i]]
for i in range(4*N, len(monthly_range_returns)) for stock in STOCKS
])
Y = np.array([
monthly_stddev_returns[stock].values[i]
for i in range(4*N, len(monthly_stddev_returns)) for stock in STOCKS
])
We will also consider modeling each stock separately:
stocks_X = np.array([
stock
for i in range(4*N, len(monthly_stddev_returns)) for stock in STOCKS
])
Our primary model will be a straightforward linear regression. We will also consider a trivial persistence model, for which the prediction is simply the most recent input value:
class PersistenceRegression:
def fit(self, X, y):
return
def get_params(self, *a, **kw):
return {}
def predict(self, X):
return X[:,-1]
We will test combinations of model types and feature types:
models = [
('Persistence', PersistenceRegression(), 'STDDEV', X_stddev),
('LinearRegression', LinearRegression(), 'STDDEV', X_stddev),
('LinearRegression', LinearRegression(), 'RANGE', X_range),
('LinearRegression', LinearRegression(), 'BOTH', X_both),
]
Our model evaluation will be based on RMSE under cross validation with many overlapping folds and a small test set. We take this approach because our final model will be trained on the entire available time series, so we'd like our evaluation to reflect the performance closely. We will also display MSE to understand the impact of metric choice.
cv = ShuffleSplit(n_splits=30, train_size=0.9, random_state=37)
for (kind, model, features, X) in models:
scores = -cross_val_score(model, X, Y, cv=cv, scoring='neg_mean_squared_error', )
mse = np.mean(scores)
rmse = np.mean(np.sqrt(scores))
print('{:>30s} : MSE = {:.6f} (all-stock model)'.format(f'{kind}/{features}', mse))
print('{:>30s} : RMSE = {:.6f} (all-stock model)'.format(f'{kind}/{features}', rmse))
scores = np.array([
-cross_val_score(model, X[stocks_X==stock], Y[stocks_X==stock],
cv=cv, scoring='neg_mean_squared_error')
for stock in STOCKS])
mse = np.mean(scores)
rmse = np.mean(np.sqrt(scores))
print('{:>30s} : MSE = {:.6f} (per-stock model)'.format(f'{kind}/{features}', mse))
print('{:>30s} : RMSE = {:.6f} (per-stock model)'.format(f'{kind}/{features}', rmse))
print()
Some interpretation of the above:
We will prodeed based on RMSE because this value is in the same units as the predicted volatility itself, thus forming the basis of the error estimate.
Based on the above, we find that per-stock modeling using LinearRegression/STDDEV is the most effective. Thus we train one model per stock:
final_models = {
stock: LinearRegression()
for stock in STOCKS
}
for (stock, model) in final_models.items():
model.fit(X_stddev[stocks_X==stock], Y[stocks_X==stock])
We will make our final predictions based on the final two months (40 trading days) of data for each stock:
final_X = {
stock: np.array([monthly_stddev_returns[stock].values[-2*N:]])
for stock in STOCKS
}
We briefly inspect the model coefficients to understand how time dependence has been extracted from the lookback window:
fig, ax = plt.subplots()
for stock in STOCKS:
coef = final_models[stock].coef_
coef = coef / np.max(np.abs(coef))
ax.plot(coef, '.-', ls=':', label=stock)
ax.legend();
We investigate how the RMSE scales with training size in order to estimate its value for the complete dataset:
train_sizes = np.r_[.1:.91:.1]
stocks_rmse = {stock: [] for stock in STOCKS}
for stock in STOCKS:
for train_size in train_sizes:
cv = ShuffleSplit(n_splits=10, train_size=train_size, random_state=37)
mask = stocks_X == stock
scores = -cross_val_score(model, X[mask], Y[mask], cv=cv, scoring='neg_mean_squared_error', )
rmse = np.mean(np.sqrt(scores))
stocks_rmse[stock].append(rmse)
fig, ax = plt.subplots()
for stock in STOCKS:
ax.plot(train_sizes, stocks_rmse[stock], '.-', label=stock, alpha=.5)
ax.legend();
From this we see that the per-stock error estimates are well-behaved for train sizes >= 50% of the available time series. We'll use a simple fit to extrapolate the error when fitting on the entire time series:
train_sizes
def expfit(x, A, x0, C):
return A * np.exp(-x/x0) + C
stocks_params = {}
stocks_errs = {}
for stock in STOCKS:
p0 = (.5, .5,.01)
params, cov = optimize.curve_fit(expfit, train_sizes[-5:], stocks_rmse[stock][-5:], p0=p0)
stocks_params[stock] = params
stocks_errs[stock] = expfit(1, *params)
We check that this is well-behaved:
fig, ax = plt.subplots()
for (i,stock) in enumerate(STOCKS):
params = stocks_params[stock]
color = f'C{i}'
ax.plot(train_sizes, stocks_rmse[stock], '.:', label=stock, alpha=.5, color=color)
x = np.linspace(.5, 1, 100)
ax.plot(x, expfit(x, *params), color=color)
ax.plot(1, stocks_errs[stock], 'o', color=color)
ax.legend();
With the possible exception of being overly optimistic about stock "c", whose data is disrupted by an apparent crash of some sort around day ~150, these seem like reasonable estimates.
We finally are ready to give volatility predictions and confidence intervals for each stock:
ypreds = {stock: final_models[stock].predict(final_X[stock])[0] for stock in STOCKS}
The monthly volatility is:
print(40*'-')
print('Monthly Volatility')
print(40*'-')
for stock in STOCKS:
print('Stock "{}": {:.1%} +/- {:.1%}'.format(stock, ypreds[stock], stocks_errs[stock]))
Finally, for the annualized volatility, we multiply by $\sqrt(12)$ (TODO: the internet says to use the sqrt, but why?)
root12 = np.sqrt(12)
print(40*'-')
print('Annualized Volatility')
print(40*'-')
for stock in STOCKS:
print('Stock "{}": {:.1%} +/- {:.1%}'.format(
stock, root12 * ypreds[stock], root12 * stocks_errs[stock]))
Here we have made predictions of future annualized monthly stock volatility based on per-minute pricing data. To this end, we performed several key steps:
This is my first serious look at time series stock data, so I may very well have important blind spots. Where necessary, I've substituted intuition for experience.
For serious work, we would want to consider more sophisticated models, such as higher order regressions or perhaps even RNNs. We would also potentially need to weight compute resource requirements against metric performance.
In a production environment, the exploratory analysis laid out above would need to be encapsulated into a clean, likely object-oriented codebase that could easily be retrained reapplied to previously unseen data.