Torch.istft NOLA check causes synchronization and massive slow down

Nabarun_Goswami · February 13, 2023, 3:39am

torch.istft spends almost all its time in the NOLA check.

Below is a screenshot of the profiling of stft followed by istft of a signal of shape (32, 4, 2, 7 * 44100). This is just a toy example, but when the istft is part of a larger neural network, it causes everything to synchronize.

I found the cause of the synchronization (see code trail below). But I am wondering if there is any way around this to avoid the synchronization.

From the ATen/native/SpectralOps.cpp#L1154

if (at::is_scalar_tensor_true(window_envelop_lowest)) {
    std::ostringstream ss;
    REPR(ss) << "window overlap add min: " << window_envelop_lowest;
    AT_ERROR(ss.str());
  }

where, the call to is_scalar_tensor_true calls at::equal function at ATen/TensorSubclassLikeUtils.h#L84

inline bool is_scalar_tensor_true(const Tensor& t) {
  TORCH_INTERNAL_ASSERT(t.dim() == 0)
  TORCH_INTERNAL_ASSERT(t.scalar_type() == kBool)
  return at::equal(t, t.new_ones({}, t.options()));
}

which in turn calls ATen/native/cuda/Equal.cpp#L29

return at::cuda::eq(self, src).all().item().to<bool>();

The .item() call does a memcopy from device to host and also synchronizes all operations on the GPU.
Is it possible to maybe allow users to bypass the NOLA check if they want?

ptrblck · February 13, 2023, 5:35am

The bypassing logic might be a good feature request as we’ve also implemented a similar approach for some linalg methods. Could you create this feature request on GitHub so that the code owners could discuss it, please?