A study of deep learning on time series data.
This blog is partly written for myself to remind me of past mistakes and lessons learned, and partly a summary of my capabilities.
My obsession with time series classification started with my master's thesis, where we explored whether sound localization could be performed using the microphone of a modern-day smartphone. To this end a neural network was trained on the cepstral coefficients of a sine sweep recorded with said smartphone. This sparked my fascination with deep learning (DL) on time series (TS), a field that is relatively new with a lot of active research.
Having graduated in lockdown I decided to explore this new to me concept further and, as one does, I did it using financial market data. Now financial data is quite (extremely) noisy and probably not the best to learn from, but learning I did and some of those lessons learned I detail below.
Some concepts I learned that relate, not only to DL on time series, but also on R&D as a whole.
Overfitting a model to the train set in which case it more so remembers the dataset than learning the data-target relation. This results in the model returning gibberish when presented with new data. Use model checkpointing based on the loss of a validation set and check for consistent performance on different subsets of the data.
Classification imbalance occurs when the train set of a classification problem does not have an even distribution of all target classes. Any model trained on this data will have more exposure to a subset of the target labels and in turn over predict this subset. Over- or under-sampling can help.
Fat tails is the regression equivalent of classification imbalance. Many machine learning techniques assume a normally distributed target and fail to adapt well when this is not the case. Quantile regression can help, as can pivoting to a classification problem using quantile classification. Sometimes a log transform can turn a non-normal distribution into one.
Time leaks occur any time data from the future directly or indirectly makes its way into the train set. An example could be using the central difference of a feature, the formula of which uses the value at
Last value forecasting is another error that can occur with regression problems. It entails that a model learns to forecast the current value
Not using walk forward optimization but cross validation instead. If you simply randomly split and shuffle a time series you run the risk of creating a validation set that basically has duplicate samples with that of the train set. The train set might, for example contain the sample
Don't spend too much time developing models. Instead, your time is better spend finding some review papers detailing the current state of the research field most similar to yours and comparing the best performing models in that field. At most, adapting existing architectures to your specific use case is ok. Developing your own models carries the risk of spending too much time coding and fixing bugs. Time should be spent instead on the data. On data exploration, data cleaning, data preprocessing. Exploring how a data pipeline will behave in production and whether it is viable or not.
Of vital importance is choosing what to predict. Not only should a model be able to learn to predict it, but the prediction needs to be able to be converted into an action. At the end of the day the goal is to convert data, through machine learning, into an action, whatever it may be. It could be a buy, sell or do nothing signal when talking about financial data or it could be to flag or to not flag something for further investigation in the context of anomaly detection. It could be any number of things.
So whatever a model outputs needs to serve this purpose. First working out the prediction-to-action pipeline to check its viability can save a lot of time in the long run. Maybe what you thought to be a useful metric to predict turns out to be inadequate or insufficient.
In applied machine learning, model development should, in my opinion, take a backseat to pipeline development; from data to action. A pipeline could be data ingress to pre-processing to a model to a prediction to post-processing to an action. All steps need to be in place before any reviewing or testing can be done. Prototyping an entire pipeline first and then iteratively improving each step is a more time efficient approach in the long run compared to trying to perfect one aspect first.
In my experience, developing and training a model takes the longest time. So if you first get your data input right, then work on getting a decent model and only afterwards start working on turning predictions into actions, only to learn that whatever you wanted to do to be unfeasible, then you just wasted a lot of time.
Problems you might encounter are speed (the data preprocessing and or model might be too slow) and usability (already discussed in "Target choice matters").
Transfer learning is a powerful thing that can be used quite liberally when working with kernel based models. In case your data has more or less input channels than that of the pretrained model, you can just repeat or drop the weights of the input channels of the model, and it will work just fine. You can also transfer a model from one domain to another. I once transferred a model trained on the CIFAR-10 dataset (trucks, dogs, ...) to a spectrogram dataset and doing so resulted in a better model, faster, compared to training from random initialization.
Analyse the data, the target and the model separately. Data preprocessing should not be fixed to one single model architecture, nor should an architecture be fixed to one target variable. Keep each component modular during development. In Pytorch Lightning this can be achieved by using the LightningDataModule and LightningModule.
I want to get better at deep learning for regression. Specifically, I want to learn how to work with imbalanced target data and how to do extreme value prediction. The issue with the first one being that not all models work well with non-normally distributed data. The challenge with the latter is that extreme values typically occur less often, making a model gravitate towards being good at predicting what does occur more often.
Time series forecasting comes with its own subset of techniques that might be slightly different from other areas of DL. Note that this is all from my own experience and might not follow universally agreed upon best practices.
First off, there is the initial dataset. Make sure you understand how the data is aligned, especially when data is summarized per window. Is the timestamp identifying the window that of the start, middle or end of said window? Align all datasets the to the same standard.
Data storage also used to be a point of concern but now I just always use parquet files for offline use and QuestDb or just in memory caching if I need an online solution.
Time series data is typically not independently and identically distributed (IID) which makes missing data imputation hard. I identify areas where any of the datasets used have missing values, potentially take a safety margin in both directions (past and future) and mask (exclude) those timestamps during model development.
Note that there are techniques to convert time series into images. If you are more familiar with image classification, then this might be an interesting approach. Multiple algorithms exist, but the general gist is to compare all points in time to each other, creating a matrix/image.
As usual, input scaling is needed when working with deep learning, but additionally, stationarity is something that needs to be taken into consideration.
With regard to scaling, there are many options. Just using per feature standard- or robust scaling can work, assuming feature-magnitude is relatively constant over time. If not, frequent retraining might be needed or alternatively some type of online scaling can be used. The easiest form of online scaling is to just scale every input sample/slice into the [0,1] range. Another approach is to do standard scaling but using the mean and standard deviation of past data. Maybe just that of the last input slice, maybe that of all of, for example, last week's data, or that of the last 100 input slices, etc. This approach is easy to implement both in development and production.
When it comes to stationarity there are again multiple options. One, just don't bother, chances are that whatever is happening in each individual input slice is stationary enough. This is very problem dependent of course, so if in doubt, or if needed, stationary can be enforced by taking the first (or maybe even a second) order difference. Another interesting approach is that of using fractional differentiation. With fractional differentiation, a balance is made between keeping some memory (magnitude information) and making it stationary. In practice a weighted sum of the original series can be taken using the weighting scheme:
with
Because time series problems can be quite large and I'm limited to my desktop computer, I do care significantly about fast model training. One plug and play solution that helps greatly is to use AdaBelief as an optimizer. Another is to use a decreasing learning rate that starts quite large. Personally I adapted Pytorch's learning rate scheduler cosine annealing with warm restarts to have a decreasing amplitude over time, something I've shared in this GitHub repo, but other schedulers can work great too. A third tool that unfortunately isn't always available is to use transfer learning, as discussed in concepts. Do use it whenever you can find a pretrained model with an input format similar to yours.
On the usage of models, I have two things to offer. Scale your action in relation to a model's confidence, and use a weighted ensemble of models. Regarding that first point, I must say that this once again depends on the problem itself, but whenever an action (the thing that happens as a result of a model's prediction) can be scaled, scale it in relation to the confidence a model displays. Secondly, there is value in combining the predictions of multiple models. In fact, InceptionTime, the current state of the art in time series classification, uses the average prediction of 5 models. An equal average of similar models should be taken, but when combining different model architectures it might pay to use an uneven weighting scheme where one model is valued more than another, e.g., if regime changes are expected.
The technical skills I have developed can be split into four broad categories.
Over the years, I have written a lot of code. Mostly the code needed to process data and to train models, but also code that ran 24/7. To this end, I got better at:
I both have experience writing production code that can run 24/7 and writing the code needed for data analysis and model training.
I might have had a desire to apply deep learning to time series, but the fact of the matter was that my python skills were limited and that I did not know how the financial markets actually work. Needless to say that some studying was needed. I learned to learn from the internet using YouTube tutorials, blog posts, dedicated educational websites, academic papers, and just plain old trial and error, a slow but effective tutor.
Critically, I learned to be critical when it comes to interpreting the results of studies on market data as there are many pitfalls which can make models seem better than they are. In short, the old adage that if something is too good to be true it probably is, rings true. Academic papers and amateur blogs tend to overinflate their results, unknowingly or not.
I am good at familiarizing myself with new domains, researching techniques that might fit that domain and being critical of said techniques.
Good data exploration requires a lot of plotting. Some of the most information dense plots are:
I am proficient in extracting meaningful insights from complex datasets.
The best way to analyse a model is to run it live. Unfortunately this is not always feasible nor a good idea.
Offline, the classic cross validation or walk forward validation care useful. Training and error metrics are also good tools to get a macro idea of model's performance. Confusion matrices and scatter plots are great ways to get a visual representation of model performance and typically also a good starting point to explore edge cases. Think false positive and false negatives, extreme predictions that are missed or average samples that are overpredicted.
Just make sure not to "mentally overfit" to the data. Be careful not to develop a bias of what a model should do and then finetune said model to that end.
To evaluate online performance a simulation could be used using either offline data being fed incrementally or using live data but in a development environment.
I am proficient in model evaluation, both at the macro and micro level.
Some frameworks I have used and found interesting are:
One of my character flaws (and another reason to iterate quickly) is that I chain together assumptions. Now, assumptions in and of itself are not necessarily bad, and there will always be some assumptions made about how the data behaves or what a model might learn, but a danger then is to develop further, without sufficient testing, assuming previous assumptions to be right. Any initial misunderstandings will cascade down making all development worthless. So when I learn a new concept I tend to, once think I have a good grasp of it, start working out an entire roadmap on what to do when. If I then learn that my initial understanding was flawed or incomplete then this roadmap is wasted. Not only was all that thinking a waste of time, it is just not fun to discover that the entire system you spend time developing, even if just on paper, needs to be binned.
A second flaw I have is that of hyper focus or perfectionism. I sometimes focus too much on things that are just not that important, but I just have to fix that one edge case or make the code more modular before being able to move on with what actually matters when it comes to creating a prototype.
On the flip side then, I have come to learn that I am persistent, even after repeated failures, I am creative, ideas just flow naturally and, something which I've known for a while now, I have an insatiable curiosity. Combine this with self-management and self-motivation and I thrive when learning and progressing.
In summary, we went from LSTMs to TCNs to to InceptionTime and then beyond.
LSTMs or long short-term memory networks were a first, successful, attempt at using deep learning on time series (TS) data. Over time it was found to be too complicated to train and a better alternative was needed. Shaojie Bai et al. proposed using temporal convolutional neural networks (TCN) in their landmark paper "An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling". Using 1D convolutions on time series data proved to be the way forward.
People also experimented with using 2D image classification techniques on TS data. Some simply plotted their data, as if taking a screenshot of a graph, others used techniques like recurrence plots or Markov transition fields to relate each point in time to all other points in time, making a 2D matrix that could be used with any existing 2D CNN architecture.
1D convolutions had some things going for them. Firstly, by only working in one dimension, the convolutional kernels used less weights than their 2D counterparts. A 2D kernel of size 3 (3x3) has 9 total weights that need to be learned (per kernel). One of 5 results in a total of 25 and a kernel of size 7 leads to 49 total weights. Larger kernels can be used with 1D convolutions, e.g. k={9, 25, 49}, without incurring additional memory requirements, which leads to a greater receptive field. This receptive fields proves to be vitally important with TS data.
It is impossible to know at which resolution relational information is held (are the ways neighbouring timestamps interact informative, or is how, for example, timestamps separated by 3 hours interact important, why not both?) To be able to relate far apart timestamps, a wide receptive field is needed, to be able to discover interactions of varying sizes (varying amounts of timestamps) kernels of different sizes can be helpful.
Other ways of increasing the receptive field, besides larger kernels, is by playing around with dilation and/or depth. Increased dilation will directly lead to a wider receptive field while increasing depth will lead to an increased receptive field in the deeper layers. Input timestamps that are originally far apart are closer together in the derived representation of the deeper layers.
Then came InceptionTime, introduced in the boldly titled paper "InceptionTime: Finding AlexNet for time series classification" by Hassan Ismail Fawaz et al. This, advanced, architecture uses kernels of differing sizes, combined with depth, to create a model that is still relevant today, and tends to be used as a benchmark.
Interestingly, this architecture uses a bottleneck layer, which is a 1D convolutional layer with kernel size 1. This layer could also be thought of as a spatial layer that only looks at how each channel of a multidimensional time series interacts with each other at each timestamp. This spatial aspect presents a very interesting and still a somewhat open question. How do we handle multivariate (or spatio-temporal) time series data? Here we are not only looking how multiple variables change over time but also how they interact whit each other at each time step and how this interaction, then, changes over time.
Currently active research seems to resolve around fixed (i.e. non-learning) convolutional kernels (either randomly generated or handcrafter) as instigated by the paper "ROCKET: Exceptionally fast and accurate time series classification using random convolutional kernels" by Angus Dempster et al. and continuous convolutions, where kernels are not necessarily learned but instead something else is trained to create intelligent kernels to be used as is. This last part might not be a 100% correct since it is a new-to-me concept and I haven't properly studied it yet.
There you go, a brief overview of the history of recent advancements in deep learning on time series data, as I experienced it.