I’m working on a stochastic method acting row wise on a tensor. I’ve realized some strange behavior on CPU vs GPU.
Suppose I use an MSE loss:
sqloss = torch.nn.MSELoss()
Then, I get the following time results. Z
requires a gradient, M
does not. For now, I’m sampling idx
at each iteration. I’ve also tried to put Z
and M
into a torch.utils.data.TensorDataset(Z, M)
, to no avail.
Z = torch.rand(1000,1000)
Z.requires_grad_(True)
M = torch.rand(1000, 1000)
idx = torch.randint(Z.size(0), size=(10,))
Replacing sqloss
from above by the MSELoss:
sqloss = MSEloss()
- GPU
In [6]: %timeit sqloss(Z, M)
49 µs ± 7.4 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [7]: %timeit sqloss(Z[idx], M[idx])
222 µs ± 32.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
- CPU:
In [10]: %timeit sqloss(Z, M)
612 µs ± 167 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [11]: %timeit sqloss(Z[idx], M[idx])
52.9 µs ± 2.25 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
This is due to the indexing:
- GPU:
In [21]: %timeit Z[idx]
96.3 µs ± 15.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
- CPU:
In [25]: %timeit Z[idx]
13 µs ± 2.63 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Is this behavior expected? Why is the subsampled loss much faster than the full loss on CPU but not on GPU ? Is there anyway to speed this up?