In a task I would like to accumulate losses computed from each time-step of a sequence, I can do the following two ways, which I supposed to be equivalent (but seems not):
- Feeding in a single time-step and compute a loss for it, until a sequence is consumed:
# note that lstm is initialized with batch_first = True
# the model structure is the common embedding -> lstm -> linear -> softmax
hidden = model.initial_hidden()
loss = 0 # initialize loss for a batch of sequences
for t in range(seq_length):
output, hidden = model(input[:, t, :], hidden)
loss += loss_function(output, y[:, t]) # accumulate loss per time-step
loss.backward()
input.size()
-> (batch_size, seq_length, input_size)
; and
output.size()
-> (batch_size, n_label)
, is the output of softmax()
; and
y[:,t].size()
-> (batch_size)
, loss_function
is NLLoss()
.
- Feeding in a whole sequence, and accumulate losses in a for loop
hidden = model.initial_hidden()
output, hidden = model(input, hidden)
loss = 0
for t in range(seq_length):
loss += loss_function(output[:, t, :], y[:, t])
loss.backward()
in this setting output.size()
-> (batch_size, seq_length, n_label)
So my question is that whether these two settings are equivalent. In my experiment, setting one is four times slower than setting two, but the accumulated loss is about two times lower (the observation of the first two epochs). Could someone help me out on indicating the essential differences between the two settings, if any?