Lessons learned over time

`A study of deep learning on time series data.`

This blog is partly written for myself to remind me of past mistakes and lessons learned, and partly a summary of my capabilities.

My obsession with time series classification started with my master's thesis, where we explored whether sound localization could be performed using the microphone of a modern-day smartphone. To this end a neural network was trained on the cepstral coefficients of a sine sweep recorded with said smartphone. This sparked my fascination with deep learning (DL) on time series (TS), a field that is relatively new with a lot of active research.

Having graduated in lockdown I decided to explore this new to me concept further and, as one does, I did it using financial market data. Now financial data is quite (extremely) noisy and probably not the best to learn from, but learning I did and some of those lessons learned I detail below.

Concepts

Some concepts I learned that relate, not only to DL on time series, but also on R&D as a whole.

Common and less common mistakes

Overfitting a model to the train set in which case it more so remembers the dataset than learning the data-target relation. This results in the model returning gibberish when presented with new data. Use model checkpointing based on the loss of a validation set and check for consistent performance on different subsets of the data.

Classification imbalance occurs when the train set of a classification problem does not have an even distribution of all target classes. Any model trained on this data will have more exposure to a subset of the target labels and in turn over predict this subset. Over- or under-sampling can help.

Fat tails is the regression equivalent of classification imbalance. Many machine learning techniques assume a normally distributed target and fail to adapt well when this is not the case. Quantile regression can help, as can pivoting to a classification problem using quantile classification. Sometimes a log transform can turn a non-normal distribution into one.

Time leaks occur any time data from the future directly or indirectly makes its way into the train set. An example could be using the central difference of a feature, the formula of which uses the value at . This results in a model that, firstly, can't be used in production since it uses data from the future and, secondly, is deceptively good during development.

Last value forecasting is another error that can occur with regression problems. It entails that a model learns to forecast the current value as the prediction for since, depending on the problem, it can be a very similar value and thus can achieve a small error. Using the current value as a forecast works until it doesn't but results in graphs that look great with two lines closely following each other. Closer inspection is needed to check model efficacy.

Not using walk forward optimization but cross validation instead. If you simply randomly split and shuffle a time series you run the risk of creating a validation set that basically has duplicate samples with that of the train set. The train set might, for example contain the sample and the validation set the sample . Data at timestamp is probably very similar to data at so if you optimize this data split you will get a model that seems really good during development but might not perform well in production.

Data first, model second

Target choice matters

Iterate quickly

Transfer learning can be used liberally

Separation of concerns

Open problems I want to explore next

Techniques

Time series forecasting comes with its own subset of techniques that might be slightly different from other areas of DL. Note that this is all from my own experience and might not follow universally agreed upon best practices.

Data management

Represent time as images

Scaling and stationarity

Fast training

Model utilization

Skills

The technical skills I have developed can be split into four broad categories.

Software development

Research

Data analysis

Model evaluation

Frameworks

Psychology

The history of deep learning on time series as I experienced it

In summary, we went from LSTMs to TCNs to to InceptionTime and then beyond.

LSTMs or long short-term memory networks were a first, successful, attempt at using deep learning on time series (TS) data. Over time it was found to be too complicated to train and a better alternative was needed. Shaojie Bai et al. proposed using temporal convolutional neural networks (TCN) in their landmark paper "An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling". Using 1D convolutions on time series data proved to be the way forward.

People also experimented with using 2D image classification techniques on TS data. Some simply plotted their data, as if taking a screenshot of a graph, others used techniques like recurrence plots or Markov transition fields to relate each point in time to all other points in time, making a 2D matrix that could be used with any existing 2D CNN architecture.

1D convolutions had some things going for them. Firstly, by only working in one dimension, the convolutional kernels used less weights than their 2D counterparts. A 2D kernel of size 3 (3x3) has 9 total weights that need to be learned (per kernel). One of 5 results in a total of 25 and a kernel of size 7 leads to 49 total weights. Larger kernels can be used with 1D convolutions, e.g. k={9, 25, 49}, without incurring additional memory requirements, which leads to a greater receptive field. This receptive fields proves to be vitally important with TS data.

It is impossible to know at which resolution relational information is held (are the ways neighbouring timestamps interact informative, or is how, for example, timestamps separated by 3 hours interact important, why not both?) To be able to relate far apart timestamps, a wide receptive field is needed, to be able to discover interactions of varying sizes (varying amounts of timestamps) kernels of different sizes can be helpful.

Other ways of increasing the receptive field, besides larger kernels, is by playing around with dilation and/or depth. Increased dilation will directly lead to a wider receptive field while increasing depth will lead to an increased receptive field in the deeper layers. Input timestamps that are originally far apart are closer together in the derived representation of the deeper layers.

Then came InceptionTime, introduced in the boldly titled paper "InceptionTime: Finding AlexNet for time series classification" by Hassan Ismail Fawaz et al. This, advanced, architecture uses kernels of differing sizes, combined with depth, to create a model that is still relevant today, and tends to be used as a benchmark.

Interestingly, this architecture uses a bottleneck layer, which is a 1D convolutional layer with kernel size 1. This layer could also be thought of as a spatial layer that only looks at how each channel of a multidimensional time series interacts with each other at each timestamp. This spatial aspect presents a very interesting and still a somewhat open question. How do we handle multivariate (or spatio-temporal) time series data? Here we are not only looking how multiple variables change over time but also how they interact whit each other at each time step and how this interaction, then, changes over time.

Currently active research seems to resolve around fixed (i.e. non-learning) convolutional kernels (either randomly generated or handcrafter) as instigated by the paper "ROCKET: Exceptionally fast and accurate time series classification using random convolutional kernels" by Angus Dempster et al. and continuous convolutions, where kernels are not necessarily learned but instead something else is trained to create intelligent kernels to be used as is. This last part might not be a 100% correct since it is a new-to-me concept and I haven't properly studied it yet.

There you go, a brief overview of the history of recent advancements in deep learning on time series data, as I experienced it.