Training on samples and inferencing on batches

kontrabas · June 19, 2020, 1:24pm

Hello,

I’ve got a question - is there possibility to train model with samples (batch size = 1) and inference this model using batches?
When I changed to batch_size = 1, I had to remove Batchnorm1D layers from my model because of errors. But I think that I will need them when I want to inference my model with batches after training.
I can’t train with batches because I’m training with incomplete data - so I’m not always “turning on” all heads. What do you suggest?

Best regards

ptrblck · June 20, 2020, 8:13am

This should be possible, as the batch size during inference could be arbitrary, if you call model.eval() and don’t use custom layers, which might depend on the batch size.

I assume the batchnorm layers raise errors, as they cannot calculate the batch statistics using a scalar value?
Why would you need the batchnorm layers during inference, if they weren’t used during training?

Could you explain this issue a bit? I’m not sure what the heads refer to and maybe it would be possible to increase the batch size somehow.

kontrabas · June 22, 2020, 8:35am

@ptrblck, thank you for your answer!

I’ll try to exaplain why I came with that idea.

So, I’ve got a model in which I would like to have about 100 different layers as outputs (each of them has few features) (“heads”). Each of the output layer (“head”) bases on the input which is coming from the same layer “X”.
Now the tricky part: because of the problem with data, which it hasn’t got always “real” outputs to compare in loss function (with output values from the model) - I’m trying to do something like that I “turn on” only this output layers in model, to which I have data for each sample.
But now (while I was writing this) I found another problem - while I’m calculating loss from this “turned on” output layers from model - in the step loss.backward() and optimizer.step() it backpropagates through all of the output layers, not only this “turned on” for which data has passed, isn’t it? Is there possibility to backpropagate and optimize only through this output layers of model (“heads”) which are turned on while training for each sample?

Because of this “turning on” different output layers - I had to remove normalization and train with samples (bs=1). But as I think now - if I train my model using samples and without normalization, I can inference with batches (without batch normalization), can’t I?

Best regards

ptrblck · June 22, 2020, 8:43am

The computation graph will only include the layers (and their parameters), which were used during the forward pass, so that only these parameters will get valid gradients if you call backward().
The parameters of the unused layers should get any gradients. Could you check it please?

I don’t completely understand, how your decide which “heads” are used for the current input.
If you use some specific statistics for the current activation, you could probably apply these check for the complete batch and send each sample to the corresponding head.

kontrabas · June 22, 2020, 10:31am

Thank you for your answer.

Ok, I’ll check it, but if you say so, it is probably like that and it was only my misunderstanding.

I don’t completely understand, how your decide which “heads” are used for the current input.
If you use some specific statistics for the current activation, you could probably apply these check for the complete batch and send each sample to the corresponding head.

I’m providing keys to forward pass of the model and inside I’ve got some conditions. After the training and before inferencing I will change that - I will remove “keys” in forward pass and make some condition where “turning on” output layers will be based on the output of one specific layer ( let’s say “Y”).
Example: If from layer “Y” I will get value 1, it will turn on output layer 1, 3 and 5. - and so on.
Because of the problem with data I don’t know how or even if I can implement training with batches - each sample can “turn on” different layers, so I provide to model different number of “keys” and I get different number of outputs and I can’t batch samples because input tensors (and on the other hand output tensors) have to be of the same size, haven’t they? And in my case, each sample can have different sizes of “keys” and outputs so tensors could not “merge” in batch.

I think that I can handle training with samples (bs=1) (with having increased time of training), but will it have any impact later on the inferencing with batches or it should be fine?

ptrblck · June 23, 2020, 12:49am

Thanks for the information about the different shapes.
If you cannot create a batch of inputs and concatenate the outputs, the easiest way would be to use a batch size of 1 as you’ve mentioned, until nested tensors are available.

A batched inference should not change the behavior of the model, but you would have to face the shape mismatch again I guess. How would the batches and concatenation work during inference, if they are not possible during training?