Dealing with 2D tensors of different shape in a batch size greater than 1 without using zero-filling or sampling

Apart from zero-filling and sampling, is there a less aggressive way to handle 2D tensors in which the first dimension is of variable size between 11 to 8000 and the second dimension is constantly 512 in a batch size greater than 1? Ideally, batch size of 64 in PyTorch?

For example, if the batch size is 4? I could have a list of such 2D tensors in a batch:

[200 * 512, 1000 * 512, 23 * 512, 7000 * 512]

Hi Mona!

I assume that your use case is to pass a batch of tensors of different sizes
through your model, calculate a loss, backpropagate, and then optimize
your model’s parameters.

You will not be able to pass your batch of tensors through your model as
a single batch (unless you zero-fill or sample, etc.) because pytorch does
not support such “ragged” tensors.

So you will have to pass your individual tensors through your model one by
by one* (probably giving them a leading singleton batch dimension), at the
possible detriment of not making fully efficient use of your gpu (and / or cpu)
pipelines.

You could either loop over the individual samples in your batch, passing
them through your model, and accumulate the per-sample losses
together into a batch_loss and then backpropagate once by calling
batch_loss.backward(), and then call opt.step().

Or you could call loss.backward() for each sample separately (.backward()
accumulates the gradients until you call .zero_grad()) and then call
opt.step(), which will act on the accumulated gradients.

Or you could backpropagate and call opt.step() for each individual sample
in the “batch.” (The extent to which batches of samples help the optimization
process is a nuanced question, but batches do help keep your gpu pipelines
full.)

In general, I would recommend the first approach (where you loop over
the samples for the forward passes, but then backpropagate and optimize
just once per batch).

*) If some of the samples in your training set have the same size – let’s
say that you can find four samples in your training set that all have shape
[200, 512] – you could package like-sized samples together into non-ragged
batch tensors – in this example case, a batch of shape [4, 200, 512] – and
pass them as single tensors through your model. You might not get the full
benefit of having a batch size of 64, but you will still likely make better use
of your gpu than if you had passed the four samples through your model
separately.

Best.

K. Frank

1 Like

Hi Frank, thanks for your response, is there a code snippet that does this batch size of one and doing the opt step and accumulated gradient when tensors in a batch are not of the same size you could point me to ?

Hi Mona!

I don’t know of an example off-hand, but it could be a simple as something
like this:

for  i in range (nBatches):   # for example, this could be the number of batches in an epoch
    input, target = get_next_batch_list():   # a batch is a list of single samples (and labels)
    opt.zero_grad()
    for  j in range (len (input)):   # loop over samples in the batch
        pred = model (input)
        loss = loss_fn (pred, target)
        loss.backward()   # accumulates gradient for this single sample
    opt.step()   # takes step using gradient accumulated over batch

Just to reiterate:

input can’t be a single tensor that contains a batch of samples, because,
according to the problem at hand, the samples have differing shapes (so
such a batch tensor would be “ragged”). Instead, input is a list of
individual sample tensors that are passed through model one by one
as we loop over the input list.

Best.

K. Frank

1 Like

Thanks a lot for your response.

I have two technical questions for you:

  1. Do you call this method “gradient accumulation”?
  2. Are there built-in mechanism for “gradient accumulation” in PyTorch or PyTorch Beta?

Hi Mona!

I don’t know whether “gradient accumulation” is an officially-defined
term. What I’m referring to is the fact that loss.backward() adds
the gradients it computed into the .grad properties of the various
Parameters.

(This is the built-in behavior of .backward(), which is why we normally
call something like opt.zero_grad() before we call loss.backward().
Normally we don’t want to accumulate the newly-computed gradient into
some previously-computed gradient that hadn’t been cleared out yet.)

Yes, in the sense that .backward() accumulates the gradients, as
described above.

Best.

K. Frank

1 Like