Torch.istft NOLA check causes synchronization and massive slow down

torch.istft spends almost all its time in the NOLA check.

Below is a screenshot of the profiling of stft followed by istft of a signal of shape (32, 4, 2, 7 * 44100). This is just a toy example, but when the istft is part of a larger neural network, it causes everything to synchronize.

I found the cause of the synchronization (see code trail below). But I am wondering if there is any way around this to avoid the synchronization.

From the ATen/native/SpectralOps.cpp#L1154

if (at::is_scalar_tensor_true(window_envelop_lowest)) {
    std::ostringstream ss;
    REPR(ss) << "window overlap add min: " << window_envelop_lowest;
    AT_ERROR(ss.str());
  }

where, the call to is_scalar_tensor_true calls at::equal function at ATen/TensorSubclassLikeUtils.h#L84

inline bool is_scalar_tensor_true(const Tensor& t) {
  TORCH_INTERNAL_ASSERT(t.dim() == 0)
  TORCH_INTERNAL_ASSERT(t.scalar_type() == kBool)
  return at::equal(t, t.new_ones({}, t.options()));
}

which in turn calls ATen/native/cuda/Equal.cpp#L29

return at::cuda::eq(self, src).all().item().to<bool>();

The .item() call does a memcopy from device to host and also synchronizes all operations on the GPU.
Is it possible to maybe allow users to bypass the NOLA check if they want?

The bypassing logic might be a good feature request as we’ve also implemented a similar approach for some linalg methods. Could you create this feature request on GitHub so that the code owners could discuss it, please?

1 Like