I came across this post and I’m facing a similar issue with PyTorch where I don’t know what would be the best approach:
I am working on a network that is supposed to run on real-time in CPU during inference, taking a noisy audio chunk of 512 samples as input to produce a clean version of the same audio chunk of 512 samples as output. The network I am using involves LSTM layers that according to the documentation require a known batch size during training of dimensions (seq_len, batch_size, input_size) which in my case would be (1, 1, 512): https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html
I would ideally like to train the network on batches bigger than 1 (e.g. batch_size=32) but use the model during inference with batch_size=1 since it will be running on real time taking single input frames of 512 samples. Also, the model has to be stateful and as far as I understand, this can be implemented by detaching the output hidden and cell states in order to feed them back in as the inicial hidden and cell states of the next batch, is that correct? Along with this one, my other questions would be:
1. Is it possible in PyTorch to train LSTM on batch_size=32 and then use the same saved model to do inference in single steps of batch_size=1? (i.e. train on (1, 32, 512) and inference in (1, 1, 512) assuming batch_first=False) if so, how? and, may the predictions be affected by choosing a different batch_size during inference?
2. Would it be okay to just do train and inference on batch_size=1 and use the model like that or would it be expected to be less time efficient than (1) during training? Overall, my main concert is to obtain the least CPU intensive model during inference.
Sorry if some questions may have an easy answer, but I couldn’t find much information about similar scenarios. Any guidelines are highly appreciated.
What would happen in such case if I were to do my initial hidden states learnable as suggested here? I am confused in the sense that now a whole bunch of time steps during training are on the same batch. I assume I should just copy the hidden states values to the new instance, but I am not sure if the ones to be copied in such case are the topmost (i.e. [:, 0, :]) bottommost or (i.e. [:, batch_size - 1 :]).
If the sequences within and between batches are independent, then it shouldn’t really matter which one of the 32 hidden states you should pick, being the topmost, bottommost, or any other one. Arguably, if you train a large amount of epoch with enough shuffling, the 32 hidden states should be kind of similar.
Or you simple make 32 inferences with for the same input sequence but with all 32 available hidden states and see if it effects the result in any way :).
Thanks again @vdw ! what could be a sound approach in the case of dependent sequences within a batch but not between batches? (basically a single audio file chopped in smaller sequential frames per batch)
Did you figure out how to solve it? I have the same problem.
I have a solution, but not sure whether it is true or not. What about copying the input 32 times, making an input with 32 batches that all of them are the same, and finally, taking an average of the output of each batch after feeding to the model? Do you think it works?