How does pytorch handle the mini-batch training?

Wmog · November 9, 2017, 9:37am

After experimenting the mini-batch training of ANNs (the only way to feed an NN in Pytorch) and more especially for the RNNs with the SGD’s optimisation, it turns out that the “state” of the network (hidden state for the RNNs and more generally the output of the network for the ANNs) has one component or one state for each mini-batch element. Thereupon, that is not very clear how Pytorch trains a neural network with a mini-batch and all the more so when the criterion is the SGD (entailing that each iteration a batch element is randomly picked).

Thank you in advance for your answers.

chenyuntc · November 9, 2017, 10:47am

Sorry, but I can’t get it , could you make it more clear?

It would be better if you could provide a snippet to illustrate your question

Wmog · November 9, 2017, 11:07am

When we pass a mini-batch to an ANN in Pytorch, how is the gradient computed on this mini-batch and how does the network parameters are updated?

Especially my starting point was my previous post where I provided a snippet:
Different outputs for identical sequences in a batch

In this thread, I showed (I guess) that the network has as much as versions of its parameters as the size of the batch (one version per element of the batch dimension). To come to this conclusion I just trained my LSTM and I finally pass it a batch with identical sequences and the output of the network turned out different. It turned out that the outputs are different over the batch dimension of the output although the elements of the mini-bath passed to the network are identical.

chenyuntc · November 9, 2017, 11:16am

what do you mean by the network has many versions?

chenyuntc · November 9, 2017, 11:20am

I probably understood.

there should be only one “version” of network!

the model should get the sample output with the same input

Could you provide a simple snippet for me to reproduce your question?

Wmog · November 9, 2017, 11:41am

I mean that the network has different parameters for each element of the batch. For example, if I set the batch size at 3 (such as in my example) then the network will output an output for each element over the batch dimension (second dimension, in Pytorch the dimensions of input and output are (Sequence, Batch, Feature)).

The training set:

training_set =\ 
            [[[1, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 1], [0, 0, 1, 0, 0, 0]],
             [[0, 1, 0, 0, 0, 0], [0, 0, 0, 0, 1, 0], [0, 0, 0, 0, 1, 0]],
             [[0, 0, 1, 0, 0, 0], [0, 0, 0, 1, 0, 0], [1, 0, 0, 0, 0, 0]],
             [[0, 0, 0, 1, 0, 0], [0, 0, 1, 0, 0, 0], [0, 1, 0, 0, 0, 0]],
             [[0, 0, 0, 0, 1, 0], [0, 1, 0, 0, 0, 0], [0, 0, 0, 1, 0, 0]],
             [[0, 0, 0, 0, 0, 1], [1, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 1]]]

The training step:

for epoch in range(100):
# Step 1. Remember that Pytorch accumulates gradients.
# We need to clear them out before each instance
model.zero_grad()

# Starting each batch, we detach the hidden state from how it was
# previously produced. If we didn't, the model would try backpropagating
# all the way to start of the dataset.
model.h_t, model.c_t = repackage_hidden((model.h_t, model.c_t))

# Step 2. Get our inputs ready for the network, that is, turn them into
# Variables of word indices.
batch_input, batch_targets = prepare_sequences(training_set, labels,
                                               batch_size)

# Step 3. Run our forward pass.
# Predicted target vertices
batch_outputs = model(batch_input)

# Step 4. Compute the loss, gradients, and update the parameters by
#  calling optimizer.step()
loss = loss_function(batch_outputs, batch_targets)
loss.backward(retain_graph=True)
optimizer.step()

The loss function:

def loss_function(preds: autograd.Variable,
              _batch_targets: autograd.Variable):
    nllloss = nn.NLLLoss()
    loss_seq = nllloss(preds.contiguous().view(-1, 6), _batch_targets.view(-1))
    return loss_seq

I compute the loss function over all the elements of the batch.

Finally I train the LSTM for a batch size of 3 that is to say on the whole dataset. The input is then of dimensions (5, 3, 6). When the training is over, want to use the LSTM for just one vector (and not for a 3-elements’ batch at a time). That is why I wrote the following function:

def one_input(model, seq: autograd.Variable, batch_size):
    if isinstance(_path, autograd.Variable):
        if len(seq.size()) == 2:
            seq = seq.view(len(seq), 1, -1)

        sizes = _path.size()
        if sizes[1] == 1:
            seq = seq.expand(sizes[0], batch_size, sizes[2])
    else:
        raise TypeError("seq must be an autograd.Variable")

    return _model(seq)

And here is the output after passing a batch containing the first sequence ducplicated 3 times to have a batch of size 3:

Variable containing:
(0 ,.,.) = 
 -1.8790 -1.7101 -1.8548 -1.7101 -1.7329 -1.8819
 -1.8773 -1.7029 -1.8542 -1.7156 -1.7324 -1.8867
 -1.8914 -1.7042 -1.8518 -1.7058 -1.7318 -1.8860

(1 ,.,.) = 
 -1.8776 -1.6937 -1.8465 -1.7505 -1.7217 -1.8775
 -1.8767 -1.6895 -1.8472 -1.7532 -1.7217 -1.8797
 -1.8821 -1.6903 -1.8464 -1.7500 -1.7195 -1.8803

(2 ,.,.) = 
 -1.8620 -1.7102 -1.8386 -1.7112 -1.7629 -1.8800
 -1.8614 -1.7081 -1.8395 -1.7123 -1.7631 -1.8807
 -1.8638 -1.7086 -1.8385 -1.7114 -1.7622 -1.8810

(3 ,.,.) = 
 -1.8820 -1.7209 -1.8325 -1.7252 -1.7310 -1.8736
 -1.8816 -1.7199 -1.8329 -1.7256 -1.7314 -1.8740
 -1.8827 -1.7201 -1.8327 -1.7256 -1.7302 -1.8742

(4 ,.,.) = 
 -1.8532 -1.7232 -1.8492 -1.7212 -1.7408 -1.8761
 -1.8529 -1.7229 -1.8494 -1.7213 -1.7410 -1.8762
 -1.8536 -1.7226 -1.8494 -1.7216 -1.7404 -1.8763
[torch.FloatTensor of size 5x3x6]

Despite the sequences of the batch are identical, over the batch dimension the outputs are different

chenyuntc · November 9, 2017, 12:04pm

could you show me the forward function of model?

Wmog · November 9, 2017, 12:07pm

def forward(self, _paths: autograd.Variable):
    lstm_out, (self.h_t, self.c_t) = self.lstm(_paths, (self.h_t,
                                                        self.c_t))
    vertices_space = self.hidden2vertex(lstm_out.view(len(_paths),
                                                      self.batch_size, -1))

    next_vertices = F.log_softmax(vertices_space.permute(2, 0, 1)) \
        .permute(1, 2, 0)
    # Permutations of the dimensions so that every slice along the 2nd dim
    # (the dimension of the output features) sums to 1 along every
    # remaining dimensions.

    return next_vertices

chenyuntc · November 9, 2017, 12:12pm

seems fine to me
try

next_vertices = F.log_softmax(vertices_space.permute(2, 0, 1).contiguous()) \
        .permute(1, 2, 0).contiguous()

Wmog · November 9, 2017, 12:18pm

The outputs for each element of the batch are still different:

Variable containing:
(0 ,.,.) = 
 -2.1165 -1.5604 -1.7455 -2.1431 -2.4632 -1.2294
 -2.1157 -1.5604 -1.7465 -2.1427 -2.4619 -1.2296
 -2.1201 -1.5600 -1.7406 -2.1448 -2.4696 -1.2285

(1 ,.,.) = 
 -2.1057 -1.5615 -1.7602 -2.1379 -2.4439 -1.2321
 -2.1054 -1.5615 -1.7605 -2.1378 -2.4435 -1.2321
 -2.1066 -1.5614 -1.7588 -2.1384 -2.4457 -1.2318

(2 ,.,.) = 
 -2.0106 -1.5763 -1.8993 -2.0947 -2.2711 -1.2610
 -2.0103 -1.5764 -1.8998 -2.0946 -2.2706 -1.2611
 -2.0117 -1.5761 -1.8976 -2.0952 -2.2731 -1.2606

(3 ,.,.) = 
 -2.1092 -1.5611 -1.7553 -2.1396 -2.4503 -1.2312
 -2.1088 -1.5612 -1.7559 -2.1395 -2.4496 -1.2313
 -2.1108 -1.5610 -1.7532 -2.1404 -2.4531 -1.2308

(4 ,.,.) = 
 -2.0189 -1.5747 -1.8864 -2.0983 -2.2865 -1.2580
 -2.0187 -1.5747 -1.8866 -2.0983 -2.2862 -1.2581
 -2.0196 -1.5745 -1.8853 -2.0986 -2.2877 -1.2578
[torch.FloatTensor of size 5x3x6]

chenyuntc · November 9, 2017, 12:28pm

could you provide a full snippet so that I could reproduce your problem in my computer (simplified network, remove train process )