I am trying to create a GRU RNN for a length-unfixed sequence prediction task. Each element of every sequence is a 13-d vector and in each dimension, the value is binary (0 or 1). What the RNN need to do is to predict next element according to previous elements series in that sequence. Now I have two approaches to implement this network:
- encode the input vector and output vector to one-hot vector, the output is a softmaxed probability distribution. In this case, the performance of RNN is perfect.
- just send the 13-d vector to RNN and output a 13-d vector and the loss was calculated via cross entropy. In this case, the performance is not good enough. Obviously, it needs promotion.
- use CNN as an encoder for RNN. Have not tried yet.
Our group hopes we could promote the performance of method 2 through hyper-parameters adjustment. However, it seems that this “encoding” approach has its intrinsic disadvantages regardless of the parameter adjustment.
Anybody have any ideas how I could choose encoder or just promote method 2?
A couple of things I’m not sure I understand:
- Is each element in the sequence a 13-d vector or just a binary value? In the second case, the input size is small and fixed, did you consider using traditional networks like MLP?
- If element is a 13-d vector, how could you one-hot it in method #1? If it is a binary value, isn’t one-hot basically same as passing in the value?
- What is the difference between method #2 and method#1? It seems that in both cases the inputs and outputs are binary. And cross entropy is used with softmax as usual for method #1, right?
- What do you mean by promotion?
- For method #3, is this an encoder-decoder model? I thought you only want to predict the next element, but it seems that the decoder is an RNN?
Sorry for the delay.
Yes, each element in the sequence is a 13-d vector, and each component of each element is a binary (e.g. sequence = [0,0,0,0,0,0,1,1,0,0,0,0,1]).
Actually, the set of possible states of elements in my dataset has is countable (say the number of its elements is 400). So I choose to encode each element into a 400-d one-hot vector, just like the character encoding in RNN.
Like I said above, with method#1, I choose to send the 400-d one-hot vector into RNN and get a 400-d vector (normalized) and view it as the probability distribution. Then I choose the max value of the output vector and decode it to a one-hot vector then to the original 13-d vector. Yes, cross entropy is usually for this method.
Performance means the correct rate of prediction in each step.
It is not a seq2seq model, what I want to do is supply the RNN input with a kind of “sparse representation” of the original 13-d vector. So the encoder and the decoder may be just sparse encoder network.
Hope these answers may be helpful for understanding.
Cross entrooy is only expected to work when the targets form a probability (nonnegative, sum to 1). If you don’t have that you need to move to multi-label loss functions to make variant 2 work. Dependimg on your problem, you could also normalize the targets and use a threshold to predict 1s.
Yes, in method#1 cross entropy is valid but not for the case of 13-d in and 13-d out and I chose to use mean squared error as the loss function.
Mean squared error is usually not considered a too great loss function for this. Did you try making it 13 binary cross entropies?
I see. From my understanding, it seems that a non-recurrent network would be better suited for this job.
For method#2, if the output doesn’t sum to 1, cross entropy isn’t really valid.