Training is broken is torch>1.5.0 is used

py23 · March 20, 2022, 1:42pm

Hi, I am trying to train this model:

However, training on any torch version>1.5.0 means that model training is broken. What I mean by this is performance doesn’t change from epoch to epoch (stays stuck at the 14db mark in terms of average performance). I have tested this code on another PC using torch 1.5.0, and performance was around 30db after an epoch and would keep increasing.

I have been trying to figure out what the problem is for quite some time but to no avail. The reason this is necessary as I cannot run torch 1.5.0, I get the following error:
Unable to find a valid cuDNN algorithm to run convolution
Newer versions work fine but I can’t run torch 1.5.0 even with the suggested solutions found online

Does anyone know why the training this model is broken on torch>1.5.0? It would be a massive help

Thanks

py23 · March 24, 2022, 2:23pm

Hi there, does anyone have any recommendations on how to solve this?

ptrblck · March 24, 2022, 11:43pm

It’s a bit hard to tell, as you are looking at almost 2 years or changes (1.5.0 was released in April 2020).
You could try to update sequentially and check which version breaks the training first.
E.g. if it’s working in 1.5.0 but not in 1.5.1 this would limit the changes to a small set only and you could check the additional commits between these versions.

py23 · March 27, 2022, 6:24pm

So the code works on torch 1.5.1 but doesn’t work on torch 1.6.0. I checked the code line by line but can’t find what’s wrong. Looking at the depreciations, I’m not exactly sure how they affect this code. It is mainly the main.py and associated core files that is affected as tried different models and training doesn’t work for all of them if this pytorch requirement is not met
Any ideas what might be the problem

py23 · March 31, 2022, 10:39am

Hi, any ideas on what’s causing this issue?
I am also looking at compiling 1.5.0 from source but how would I do it for the A100 architecture in anaconda? Not sure if it would be possible though as I read I have to have cuda 10.1 instead of 11 which is not possible?