The example code is:
# input_size: N * 1024 * 1
output = torch.squeeze(input)
# output_size: N * 1024
where N is batch_size.
- When the code is running with single GPU mode, the output size is correct, i.e., N * 1024.
- However, when using multiple GPUs (for example, 4 GPUs), the output size is wired:
- When N is large, e.g. 64, the output size is correct.
- When N is small, e.g. N=2, the output size will be 2048…
PS, when using output = input.squeeze(-1)
, the output is always correct.