Subsampling rows speedups? CPU vs GPU

I’m working on a stochastic method acting row wise on a tensor. I’ve realized some strange behavior on CPU vs GPU.

Suppose I use an MSE loss:

sqloss = torch.nn.MSELoss()

Then, I get the following time results. Z requires a gradient, M does not. For now, I’m sampling idx at each iteration. I’ve also tried to put Z and M into a torch.utils.data.TensorDataset(Z, M), to no avail.

Z = torch.rand(1000,1000)
Z.requires_grad_(True)
M = torch.rand(1000, 1000)
idx = torch.randint(Z.size(0), size=(10,))

Replacing sqloss from above by the MSELoss:
sqloss = MSEloss()

  • GPU
In [6]: %timeit sqloss(Z, M)
49 µs ± 7.4 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [7]: %timeit sqloss(Z[idx], M[idx])
222 µs ± 32.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
  • CPU:
In [10]: %timeit sqloss(Z, M)
612 µs ± 167 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [11]: %timeit sqloss(Z[idx], M[idx])
52.9 µs ± 2.25 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

This is due to the indexing:

  • GPU:
In [21]: %timeit Z[idx]
96.3 µs ± 15.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
  • CPU:
In [25]: %timeit Z[idx]
13 µs ± 2.63 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Is this behavior expected? Why is the subsampled loss much faster than the full loss on CPU but not on GPU ? Is there anyway to speed this up?