How to debug a network which is crashing after 2000 steps?

thomasryck · July 23, 2019, 8:55am

I am trying to use NasNet to do feature extraction for an OCR use case.

However after like 2300 steps I have the following error:

…
File “C:\Users\tryck\Documents\OCR_from_scratch\model_custom\NasNet_layers.py”, line 159, in forward
comb_iter_1 = comb_iter_1_left + comb_iter_1_right
RuntimeError: The size of tensor a (16) must match the size of tensor b (9) at non-singleton dimension 0

My code is deterministic I set a seed and deactivate non deterministic cuda calculation.
The error is always the same but do not always appear at the same step.

What should I do to be able to solve this issue ?

Oli · July 23, 2019, 9:28am

Hi, there are several approaches that I can think of.

Assign each data item with a unique identifier (e.g. image path) and log that when the error occurs. Perhaps this datapoint is a bit funky?
Built a testcase for your dataloader where you loop through all the data and assert that it is the correct shape, datatype etc…
Make your code saveable/resumable so that you don’t have to redo all the 2000 steps before debugging, but maybe just 100.

In your case it seems like comb_iter_1_left has the batch size of 16 whilst the right one has 9. As to why is impossible to say for me but perhaps your dataset isn’t evenly dividable with 16 so the last batch is just 9?

Edit: What steps are you taking to make sure that your program is deterministic?

thomasryck · July 23, 2019, 9:41am

To assert that the code is deterministic I set the following values, to arbitrary values:

random.seed(opt.manualSeed)
np.random.seed(opt.manualSeed)
torch.manual_seed(opt.manualSeed)
torch.cuda.manual_seed(opt.manualSeed)

cudnn.benchmark = True
cudnn.deterministic = True

Oli · July 23, 2019, 9:57am

Ok, that looks good. But change the benchmark to False in accordance to the docs. Any thoughts to my reply?

thomasryck · July 23, 2019, 12:43pm

I have checked and my dataset is not divisible by 16. So I modified it in order to. I changed the cudnn.benchmark value to false as given by the linked doc.

Thanks for your help