I have a step in my model which is very expensive as it does operation over the entire vocabulary. Let’s say the step is as follows
input is of dimension bsz x seq_len x d
output is of dimension bsz x seq_len x k
I found that batch size of 128 works well for this step but anything more than that would OOM. I am thinking of ways to make the model evaluation faster. Hence to support I am wondering if I can set the eval batch size to 1024 and in each iteration instead of calling
vocab_process once, call it 8 times.
outputs =  for i in range(8): output =self.vocab_process(input[i*128:(i+1)*128]) outputs.append(output) model_out = torch.stack(outputs)
Also one more step that im considering is inside the implementation of
del tensor_name to free up memory once i dont need a tensor anymore. How does calling
del compare to making everything inplace?
Are these two techniques reasonable? Are there better/cleaner ways to do this?