I’ve noticed that one of my model changes the output/prediction if I change the batch size (number of sequences). After some tests, I’ve found out that
nn.Linear layers give slightly different results for the same input values if the batch size is changed.
Running the following code:
import torch.nn as nn
import numpy as np
import random as rnd
seed = 101
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = True
f = nn.Linear(1024, 64).cuda()
x = torch.ones(16, 300, 1024).float().cuda()
for n in range(x.shape):
y = f(x)[:n+1]
z = f(x[:n+1])
print(torch.equal(y, z), torch.abs(y - z).mean().item())
The same is also true for
nn.Conv1d layers, though seemingly not for
nn.Conv2d layers. Setting
torch.backends.cudnn.enabled = False seemingly fixes the issue for
nn.Conv1d, but not for
nn.Linear layers. Finally, this effect only appears on GPU and not if those calculations are run on CPU.
Is this expected behavior (due to some float precision issues) or a bug?
pytorch 1.9.0 (py3.8_cuda11.1_cudnn8_0) if that makes a difference.
This would be expected as different batch-sizes will likely dispatch to different algorithms that can have different numerical behavior (e.g., due to differences in how reductions are ordered).
I observe similar behavior with float32 on
sm_86, but e.g., changing the workspace size via
CUBLAS_WORKSPACE_CONFIG=:16:8 seems to reduce the number of batch sizes that produce slightly different results, and using float64 results in equal comparisons throughout. In general however, there aren’t any guarantees for identical behavior for different batch sizes—deterministic is only intended to guarantee results for the same batch size.
Thank you for the quick answer!
:16:8 (instead of my previous
:4096:8) indeed reduced the issue quite a lot (at least for
nn.Linear). Going from float32 to float64 reduced the numerical differences by several orders of magnitude, but did not eliminate them; at least on my hardware (RTX 3060).
As far as I understood, I would need to always enforce the same batch size (same as used during training), even when predicting only a single sample (i.e. add dummy samples). Otherwise I risk that:
a) I get worse performance, because the model makes different calculations as it did during training.
b) I get different results for the same sample, depending on how many other samples I predict in the same batch.
No, usually you wouldn’t expect a) and b), since the limited numerical precision is expected and neither value is the “true” value as they depend which values are representable in
You can read more about it in this Wikipedia article.
If your model is sensitive to the expected numerical noise, you might need to use a wider
dtype, such as