RuntimeError: setStorage

wjaskowski · September 25, 2023, 6:34am

I have been fighting with a weird RuntimeError which I cannot find the cause of. It is:

RuntimeError: setStorage: sizes [1, 1, 595, 768], strides [456960, 456960, 768, 1], storage offset 0, and itemsize 4 requiring a storage size of 1827840 are out of bounds for storage of size 0

and the stacktrace tells me that it is raise in this line:

image_tensors = [images_tensors[i].unsqueeze(0) for i in ids]

The above code is executed on CPU in preprocessing. The error is non-deterministic. I cannot reproduce it in controlled settings. I observed that it happens only if num_workers>0 in DataLoader. Usually I need to wait for it for hours.

I really don’t get why the strides values are the product of the image size (595*768=456960).

Could you help me debug this problem?

yugaljain1999 · September 25, 2023, 4:42pm

I am also facing same issue. Please help.
Thanks

wjaskowski · September 25, 2023, 8:01pm

So I think I found a workaround. This is a very inelegant workaround so I am even a bit ashamed to even share it ;).

I observed that if you try/except this line, you can evaluate the “problematic” line without any errors. (This is really weird. Sounds like a some kind of race condition). So the solution is to just pack this line in a try/except/retry loop, e.g.

for i in range(100):
  try:
    image_tensors = [images_tensors[i].unsqueeze(0) for i in ids] 
  except RuntimeError:
    logger.warning("This nasty bug again! Ghrrr")
  else:
    break

Believe it or not, this works for me flawlessly. 100 is probably not necessary. Maybe even 2 would work. @yugaljain1999. Let me know if this works also for you. I am also very curious what error and stacktrace you are seeing.

ptrblck · September 26, 2023, 2:49am

Did you run into this error just recently after changing anything in your setup, e.g. after updating PyTorch, or did you see it (sometimes) in the past already?

wjaskowski · September 26, 2023, 6:59am

This project is fairly new (several months) so I have been using PyTorch 2.0.1 from its beginnings. Thus, I have no data to tell if this problem has been there in 1.3. The issue happens once in several hours of training so it is at the same time rare and annoying.

yugaljain1999 · September 26, 2023, 7:16am

I am facing this error -

RuntimeError: setStorage: sizes [96, 1], strides [1, 0], storage offset 0, and itemsize 8 requiring a storage size of 768 are out of bounds for storage of size 0

and this stacktrace -

torch.gather(
                cand_bbsz_idx, dim=1, index=active_hypos,
                out=active_bbsz_idx,
            )

Should I go with try except approach for this? @ptrblck @wjaskowski
It would be appreciable if u help me in this, as it’s very important for me.
This issue wasn’t occuring 1 month back but now suddenly it’s happening, that is strange.
I was on Pytorch 1.13.1.
Tried to upgrade to latest version as well, but same error is there.

ptrblck · September 26, 2023, 1:11pm

What changed in your setup?

yugaljain1999 · September 26, 2023, 3:07pm

Nothing, everything is same since then.

wjaskowski · September 29, 2023, 4:32am

@ptrblck So I guess, this might be an issue in PyTorch. It happens on both v100 and a100 machines. My workaround works still fine. Is there any way I could help to pin it down?

ptrblck · September 29, 2023, 4:37am

Without a code snippet to reproduce the issue I won’t be able to do much. Also, apparently nothing changed in @yugaljain1999’s setup but somehow the error suddenly popped up, which also doesn’t sound too promising in debugging.
You could try to launch the script with a debugger to get more information about when exactly the error is raised, which might point to some specific conditions on your script.

wjaskowski · September 29, 2023, 2:48pm

@ptrblck Debug is of no help. As I wrote, when the exception is raised, I can evaluate the expression that raised the exception without any problems. So, I guess, this must be some internal raise-condition between two CPU processes.

ptrblck · September 29, 2023, 3:43pm

In that case maybe a bisecting nightly binaries might help assuming you are able to recreate the issue quickly, which would then allow us to check for related commits potentially causing the issue. Are you using a nightly release or a “stable” one?

wjaskowski · October 1, 2023, 5:33pm

I am using stable 2.0.1. When 2.1 is out, I can let you know if it still happens. For now, my nasty workaround has been working flawlessly.