Using transformer on timeseries

vdg · December 1, 2020, 3:28pm

Hi, I am trying to get a transformer to do some simple timeseries forecasting, but I am struggling with finding the right way to present the data to the network.

The input and target should have dimensions {batch, sequence, features}.
If I want to train it on a sequence (1,2,3,…) with the goal to predict the next number, how should I format the data?

My guess would be
[1, 2, 3, 4;
2, 3, 4, 5;
3, 4, 5, 6]
So I have feature size 3, and sequence size 4 in this case. The target would be shifted right one column:
[ 2, 3, 4, 5;
3, 4, 5, 6;
4, 5, 6, 7]

Then, to train the transformer something like:

tf_model = torch.nn.transformer()
opt = torch.optim.Adam(tf_model.parameters())
loss_fn = torch.CrossEntropyLoss() # Or other

out = tf_model(input,target);
loss=loss_fn(out,target)
loss.backward()
optimizer.step()

Can anyone point me to some examples where the transformer module is used on something else than NLP?

utkuumetin · December 3, 2020, 11:49am

Hi! You are right, example if you want predict next token in sequence, you have to do like this

Input: [1, 2, 3, 4, 5]
Output: [2, 3, 4, 5, 6]

Attention mechanism more effective at predicting sequences based on certain rules compared to older models ex. for usage language models, protein sequences.

If you want understand transformers and attention mechanism, you can look GPT source code, its not very complex and its understandable.

vdg · December 3, 2020, 1:47pm

So, as a simple example, I would say this should work, but it does not output the same as the target after a lot of epochs. With ADam it almost doesn’t move, with SGD or Adam with high LR, it converges to a wrong solution where each row is filled with entries 50-65. CrossEntropyLoss seems to be for multiclass classification, so thats why I tried MSELoss.

Any ideas how to make output converge to the target?
And how to sample multiple batches from the series?

import torch
import math
import torch
import torch.nn as nn
import torch.nn.functional as F

data = torch.arange(0,1000000,1).float()

feature_size = 16
sequence_size = 100



src = torch.as_strided(data,(sequence_size,feature_size),(1,1)).unsqueeze(1)
target = torch.as_strided(data[1:],(sequence_size,feature_size),(1,1)).unsqueeze(1)
print(target.size())
# size = (Sequence, Batch, Features)
tf_model= nn.Transformer(feature_size,8,2,2,2,0.2)

src_mask = tf_model.generate_square_subsequent_mask(sequence_size)
optimizer = torch.optim.SGD(tf_model.parameters(),lr=0.1)
loss_fn = torch.nn.MSELoss()

for epoch in range(1000):
    out = tf_model(src,target,src_mask)
    optimizer.zero_grad()
    loss = loss_fn(out,target)
    loss.backward()
    optimizer.step()

out = tf_model(src,target)
print(target)
print(out)
#Should be the same after some training?

vdg · December 3, 2020, 2:20pm

Playing around with the params, I find that it quickly converges to outputting
[[50.5000, 51.5000, 52.5000, …, 63.4999, 64.5000, 65.5000]]], for each (100) rows of the output.
So the wrong thing is being optimized, it seems.

utkuumetin · December 3, 2020, 7:05pm

I quickly read your code and i saw a small problem as i understand you try to predict 16-length array
ex.
Input: [ 1., 2., 3., …, 14., 15., 16.]
Output: [ 2., 3., 4., …, 15., 16., 17.]

Transformer model take input shape as (Seq lenght, Batch size, Features) for ex. i have a 10 word sequence and i need to predict 11th word in the sequence my input shape have to be (10, 1, feature_size)

Please correct if I misunderstood your input output shape

These tutorials can help you:

vdg · December 3, 2020, 9:29pm

Thanks for your help!

I used 16 features, so a “history” on each timestep, but the sequence size is 100. I could also set the feature size to 1, to make it more clear. I hoped having some history in the feature vector would be beneficial.

Now, that I look at it, I think the position encoding is missing. I thought this was incorporated in the nn.Transformer() block.

So I guess I have to define a positional encoding module, and then do out = tf_model.forward(positional_encoder.forward(src),positional_encoder.forward(target))
And then some decoder?
Or don’t I have to do the positional encoding on the target?

Thanks!

utkuumetin · December 4, 2020, 9:10am

You have to code positional encoding yourself, pytorch hasn’t have this.

I work with Transformers in NLP section and i usually use only Transformer Encoder layer for forecasting so i add simple example

import torch
import math
import torch
import torch.nn as nn
import torch.nn.functional as F

class Example(nn.Module):
    def __init__(self, vocab_size, feature_size):
        super(Example, self).__init__()
        self.embedding = nn.Embedding(vocab_size, feature_size)
        self.layers = nn.TransformerEncoderLayer(d_model=feature_size, nhead=8)
        self.transformer = nn.TransformerEncoder(self.layers, num_layers=6)
        self.decoder = nn.Linear(feature_size, vocab_size)
        
    def forward(self, x):
        # X shape: [seq_len, batch_size]
        print(x.shape)
        x = self.embedding(x)
        # X shape: [seq_len, batch_size, feature_size]
        print(x.shape)
        x = self.transformer(x)
        # X shape: [seq_len, batch_size, feature_size]
        print(x.shape)
        x = self.decoder(x)
        # X shape: [seq_len, batc_size, vocab_size]
        print(x.shape)
        
        return x
    
model = Example(10, 512)
src = torch.LongTensor([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]]).view(10, 1)
model(src)

vdg · December 8, 2020, 8:16am

Thanks for your help!

Using you example, I made it so it will continue counting the Long Ints, (or to finish helloworl):

import torch
import math
import torch
import torch.nn as nn
import torch.nn.functional as F

class PositionalEncoding(nn.Module):

    def __init__(self, d_model, dropout=0.1, max_len=5000):
        super(PositionalEncoding, self).__init__()
        self.dropout = nn.Dropout(p=dropout)

        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0).transpose(0, 1)
        self.register_buffer('pe', pe)

    def forward(self, x):
        x = x + self.pe[:x.size(0), :]
        return self.dropout(x)





class Example(nn.Module):
    def __init__(self, vocab_size, feature_size):
        super(Example, self).__init__()
        self.pos_encoder = PositionalEncoding(feature_size)
        self.embedding = nn.Embedding(vocab_size, feature_size)
        self.layers = nn.TransformerEncoderLayer(d_model=feature_size, nhead=8)
        self.transformer = nn.TransformerEncoder(self.layers, num_layers=6)
        self.decoder = nn.Linear(feature_size, vocab_size)
        
    def forward(self, x):
        # X shape: [seq_len, batch_size]
        print("Input size [seq_len, batch_size]")
        print(x.shape)
        x = self.embedding(x)
        print("Embedding size [seq_len, batch_size, feature_size]")
        print(x.shape)
        x = self.pos_encoder(x)
        # X shape: [seq_len, batch_size, feature_size]
        print(x.shape)
        x = self.transformer(x)
        # X shape: [seq_len, batch_size, feature_size]
        print(x.shape)
        x = self.decoder(x)
        # X shape: [seq_len, batc_size, vocab_size]
        print(x.shape)        
        return x
    

ntokens = 10000
data = torch.arange(0,ntokens,1)

feature_size = 30
sequence_size = 100
src = torch.as_strided(data,(sequence_size,feature_size),(1,1))
tgt = torch.as_strided(data[1:],(sequence_size,feature_size),(1,1))
print(src.shape)
print(tgt.shape)

model = Example(ntokens, 512)

# Hello world: A=0, Z=25
# src = torch.LongTensor([[7,4,11,11,14,22,14,17,11]]).view(9,1) #helloworl
# tgt = torch.LongTensor([[4,11,11,14,22,14,17,11,3]]).view(9,1) #elloworld

# 1234
# src = torch.LongTensor([[1,2,3,4,5,6,7,8,9]]).view(9,1) 
# tgt = torch.LongTensor([[2,3,4,5,6,7,8,9,10]]).view(9,1) 

criterion = nn.CrossEntropyLoss()
lr = 1.0 # learning rate
optimizer = torch.optim.SGD(model.parameters(), lr=lr)

for epoch in range(100):
    out = model(src)
    outv =out.view(-1,ntokens)

    print(outv)
    print(outv.shape)

    optimizer.zero_grad()
    loss = criterion( out.view(-1,ntokens), tgt.reshape(-1) )
    loss.backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), 0.5)
    optimizer.step()
out = model(src)
print(out.shape)
outvals = torch.argmax(out,2)
print(outvals)
print(outvals.shape)

test1 = torch.arange(0,30,1).unsqueeze(1)
test2 = torch.arange(500,600,1).unsqueeze(1)
test3 = torch.arange(10,50,1).unsqueeze(1)

print(test1.shape)
print(test2.shape)
print(test3.shape)

testout1 = model(test1)
testout2 = model(test2)
testout3 = model(test3)

print(testout1.shape)
print(testout2.shape)
print(testout3.shape)
testout1=torch.argmax(testout1,2)
testout2=torch.argmax(testout2,2)
testout3=torch.argmax(testout3,2)

print(testout1.shape)
print(testout2.shape)
print(testout3.shape)

print(testout1)
print(testout2)
print(testout3)

Now, for timeseries (of float), I would have to get rid of the embedding, replacing it with a general Linear layer (I think). Also CrossEntropyLoss should be changed to any other loss function (MSELoss, …)
And probably the positional encoding would become more of an interference when working with floats.

utkuumetin · December 8, 2020, 1:42pm

I’m glad to help you. Embedding layer use LongInt, if you want float type you need another solution
, but Transformers usually used with sequence of tokens that repeating like DNA, RNA, text, image, i not suggest use case for transformers other than text, image or biology.

Quick update:
You can look this post on reddit.

superunification · December 8, 2020, 3:08pm

Yeah, to leverage attention, you need to be predicting a finite number of embeddings that hopefully you see lots of in training.
So to use this for time series prediction, you want a transformer to operate on higher level, discrete features than the sample space. Applying it directly to samples is like a classification problem with 2^16 classes (for 16 bit audio, say), which is probably too many and this problem formulation ignores the inherent correlation between classes.

Transformers should be used to predict things like beats, words, high level recurring patterns.

An architecture might be Time series → Conv blocks → quantization → Transformer → Deconv → Fully connected → Time series. Check out Facebook’s Wav2Vec paper for such an example.

nurkbts · December 25, 2020, 6:09pm

Hello, I’m trying to do a time series classification application on transformer right now. The data I have is kinematic data in a video, each line corresponds to a frame and there are 14 feature + 1 labels in my dataset. I have a few examples of github in my hand, but how should I give it to the model as a dataset, there are no labels in the methods in prediction problems, how should I give it? I will be glad if you can return urgently, thank you