[LSTM] Different outputs for identical sequences in a batch

I noticed that after training my LSTM for a defined batch size, there actually are as much as versions of the LSTM network as the number of elements in the batch. Indeed, after the training stage when I pass to my LSTM network a batch where all sequences are intentionally identical (actually I just duplicate a sequence as much as the number of sequences in a batch specified for this training instance), surprisingly for each (identical) sequence the outputs are different.

I know that to train an LSTM by passing a batch (and no just one sequence at a time) it is maintained one copy of the network for each example in the batch, but normally at the end of training these parallel networks are merged, are not they? However, here, it seems that the different versions of the network are not merged. Maybe there is a point that I do not understand. Could you explain to me whence this behaviour does come from and how does fix it, please?

Here is my dataset:

training_set =\ 
            [[[1, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 1], [0, 0, 1, 0, 0, 0]],
             [[0, 1, 0, 0, 0, 0], [0, 0, 0, 0, 1, 0], [0, 0, 0, 0, 1, 0]],
             [[0, 0, 1, 0, 0, 0], [0, 0, 0, 1, 0, 0], [1, 0, 0, 0, 0, 0]],
             [[0, 0, 0, 1, 0, 0], [0, 0, 1, 0, 0, 0], [0, 1, 0, 0, 0, 0]],
             [[0, 0, 0, 0, 1, 0], [0, 1, 0, 0, 0, 0], [0, 0, 0, 1, 0, 0]],
             [[0, 0, 0, 0, 0, 1], [1, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 1]]]

So it consists of 3 sequences of length 5 with 6 features per element of a sequence. Then I train the LSTM for a batch size of 3 that is to say on the whole dataset. The input is then of dimensions (5, 3, 6). When the training is over, want to use the LSTM for just one vector (and not for a 3-elements’ batch at a time). That is why I wrote the following function:

def predict(model, seq: autograd.Variable, batch_size):
if isinstance(_path, autograd.Variable):
    if len(seq.size()) == 2:
        seq = seq.view(len(seq), 1, -1)

    sizes = _path.size()
    if sizes[1] == 1:
        seq = seq.expand(sizes[0], batch_size, sizes[2])
else:
    raise TypeError("seq must be an autograd.Variable")

return _model(seq)

This function just replicates my sequence seq to get a batch of the needed size specified by my model - model being an LSTM stacked with a linear layer and log softmax as output layer.

Normally, at this stage, the output should be identical for each element over the dimension of the batch, namely over the second dimension representing each element of a batch, as the input sequences are identical.

To test, I used (and so duplicated) the first sequence with the method expand, I ran training for just 10 epochs and here is the output:

Variable containing:
(0 ,.,.) = 
 -1.8790 -1.7101 -1.8548 -1.7101 -1.7329 -1.8819
 -1.8773 -1.7029 -1.8542 -1.7156 -1.7324 -1.8867
 -1.8914 -1.7042 -1.8518 -1.7058 -1.7318 -1.8860

(1 ,.,.) = 
 -1.8776 -1.6937 -1.8465 -1.7505 -1.7217 -1.8775
 -1.8767 -1.6895 -1.8472 -1.7532 -1.7217 -1.8797
 -1.8821 -1.6903 -1.8464 -1.7500 -1.7195 -1.8803

(2 ,.,.) = 
 -1.8620 -1.7102 -1.8386 -1.7112 -1.7629 -1.8800
 -1.8614 -1.7081 -1.8395 -1.7123 -1.7631 -1.8807
 -1.8638 -1.7086 -1.8385 -1.7114 -1.7622 -1.8810

(3 ,.,.) = 
 -1.8820 -1.7209 -1.8325 -1.7252 -1.7310 -1.8736
 -1.8816 -1.7199 -1.8329 -1.7256 -1.7314 -1.8740
 -1.8827 -1.7201 -1.8327 -1.7256 -1.7302 -1.8742

(4 ,.,.) = 
 -1.8532 -1.7232 -1.8492 -1.7212 -1.7408 -1.8761
 -1.8529 -1.7229 -1.8494 -1.7213 -1.7410 -1.8762
 -1.8536 -1.7226 -1.8494 -1.7216 -1.7404 -1.8763
[torch.FloatTensor of size 5x3x6]

I do not understand why the values are not identical over the 2nd (batch) dimension for each time of my sequence.

Thank you in advance.

to illustrate in case you enter this topic:

the output is different because the hidden states is different. The model doesn’t have multi-versions for inputs.