Concerning unknown errors with Torch.abs() and F.elu()

Hello everyone,

I’ve been decoding for many hours but I still haven’t been able to figure out why.

  1. Torch.abs() returns unknown error when running in terminal, with PyTorch Lightning, and Wandb (if that matterrs).

    • no NaN values
    • range is 0,1.
    • size is (20,1600,16)
    • tensor sum to 12700, meaning that we probably have a Sparse Tensor
      * This “sparse tensor” thing is the only cause for the error that came to mind, as I find several relevant posts online.
  2. F.elu(x) produces unknown error.

    • x is also range (0,1), with no NaN values.
    • this is taking place after the two down-sampling stages, in the first up-sampling stage.
    • within the first up-sampling stage, this is after 2 convolutional layers + 2 F.elu(x) + 3rd convolutional layer.

I am very puzzled…Could both of these be due to memory error?

Update: Sad…but I am pretty sure Out of Memory error is involved in this.

I have tried clearing cache, but no luck. (fairseq/trainer.py at 50a671f78d0c8de0392f924180db72ac9b41b801 · pytorch/fairseq · GitHub)

Currently, I have been using batch size 64, but I guess I might have to try batch size 32.

Still open to more solutions and tips.

P.s. My code is all vectorized.