Hi
I made a Napari plugin in which I train a CNN in a qt thread_worker. When my CNN is trained from scratch I don’t get any problem, but when the network is pretrained, I always get this type of error at some point: RuntimeWarning: RuntimeError in aborted thread: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [4, 32, 1, 1]] is at version 9; expected version 8 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!!
Each 20 epochs, I do
training_thread.pause()
model.eval() #prediction code
traing_thread.resume()
It seems the error occurs when I make the prediction even if I use with torch.no_grad(). The weirdest thing is that it only happens when pretrained, whereas the CNN are the same only the weights differ. Any idea?
Hi Clément!
To debug, see if you can find the tensor of the reported shape. You can
then use, for example, print (t._version)
to locate the operation that
changes its ._version
from 8
to 9
. That will be the problematic inplace
operation that you will need to fix.
It looks like you have run this with with torch.autograd.detect_anomaly():
.
What is the forward-call backtrace that detect_anomaly()
produces telling
you? It can be very helpful for debugging if you read through it carefully.
Just to be clear, model.eval()
does not use with torch.no_grad()
, if
that’s what you were thinking. (Instead, model.eval()
turns off things like
Dropout
.)
Also, the training_thread.pause()
stuff looks suspicious. I could well
imagine that if you pause the the thread during, say, a backward pass,
perform a forward pass (for example, for evaluation), and then resume
the backward pass, you could break things with an inplace modification.
Best.
K. Frank
As I am working in a thread, it does not show the operation which causes problems. But, I noticed that each time it crashes, the responsible tensor is of different shape. Hence, it seems it is not related to a single operation, and the one which makes it all crash down is random.
model.eval() does not refer to model.eval() of Pytorch, model is a python object containing a net object. Hence model.eval(params) is a custom function for evaluation in which we make model.net.eval().
Other weird thing, it only crashes for transfer learning andf it only crashes in the second case of this funciton:
if dessin_widget.little_res_window_button.value:
training_thread.pause()
# Plot the loss values
loss_curve = loss_plot.plot(loss_list, pen=(255, 102, 0), clear=True)
# Refresh the plot
loss_plot.autoRange()
channels = [dessin_widget.chan.choices.index(dessin_widget.chan.value),
dessin_widget.chan2.choices.index(dessin_widget.chan2.value)]
square_coords = dessin_widget.viewer.value.layers["little window"].data[0]
# Récupérer les limites de la région à extraire
x_min = max(int(np.min(square_coords[:, 0])), 0)
x_max = int(np.max(square_coords[:, 0]))
y_min = max(int(np.min(square_coords[:, 1])), 0)
y_max = int(np.max(square_coords[:, 1]))
cropped_image = dessin_widget.viewer.value.layers["image"].data[x_min:x_max, y_min:y_max].copy()
mask_labels = model.eval(cropped_image, diameter=int(dessin_widget.diameter_field.value),
flow_threshold=float(dessin_widget.flow_th_field.value),
cellprob_threshold=float(dessin_widget.cell_th_field.value),
channels=channels, omni=OMNI, channel_axis=2)[0]
dessin_widget.viewer.value.layers["CP result"].data = np.zeros_like(dessin_widget.viewer.value.layers["CP result"].data)
dessin_widget.viewer.value.layers["CP result"].data[x_min:x_max, y_min:y_max] = mask_labels
dessin_widget.viewer.value.layers["CP result"].refresh()
training_thread.resume()
else:
training_thread.pause()
# Plot the loss values
loss_curve = loss_plot.plot(loss_list, pen=(255, 102, 0), clear=True)
# Refresh the plot
loss_plot.autoRange()
if (epoch - 1) % int(dessin_widget.show_res_each_button.value) == 0:
channels = [dessin_widget.chan.choices.index(dessin_widget.chan.value),
dessin_widget.chan2.choices.index(dessin_widget.chan2.value)]
mask_labels = model.eval(dessin_widget.viewer.value.layers["image"].data,
diameter=int(dessin_widget.diameter_field.value),
flow_threshold=float(dessin_widget.flow_th_field.value),
cellprob_threshold=float(dessin_widget.cell_th_field.value),
channels=channels, omni=OMNI, channel_axis=2)[0]
dessin_widget.viewer.value.layers["CP result"].data = mask_labels
dessin_widget.viewer.value.layers["CP result"].refresh()
training_thread.resume()
Finally, The threadworker yields to this function at the end of an epoch one the backward() hasa been computed, hence it should not be a problem. Moreover, why would it work in all cases except in transfer learning when predicting the whole image instead of a little crop ?
PS: I think I noticed something important: Every time I make a prediction when training resumes, the loss function makes a jump like it suddenly increases. It seems like the inference changes some weights even if it shouldn’t, how can it be possible?
As I don’t have a clean solution, I found a brut force solution for the moment:
from copy import deepcopy
predict_model = deepcopy(model)
mask_labels = predict_model.eval(dessin_widget.viewer.value.layers["image"].data,
diameter=int(dessin_widget.diameter_field.value),
flow_threshold=float(dessin_widget.flow_th_field.value),
cellprob_threshold=float(dessin_widget.cell_th_field.value),
channels=channels, omni=OMNI, channel_axis=2)[0]
I know it’s absolutely not optimal from a memory point of view, but at least it works until I find a cleaner solution.
Hi Clément!
I’ve never looked at detect_anomaly()
's forward-call backtrace when it
was produced inside a thread, but are you sure it isn’t providing correct
information? It would seem to me that it would still work, even in a thread.
This kind of “randomness” is a very common symptom of threading issues.
QThreads have a well-deserved reputation for being difficult to use correctly.
(I’m not up to date on the QThreads documentation, but in the past it included
some technically-incorrect information and offered some bad design advice.)
You pause a thread, run your model, restart the thread, and you get an
inplace-modification error the second time you do this. This is almost
surely the result of a threading problem.
In principle this sounds okay, but there is a lot of circumstantial evidence
pointing to a threading problem.
Hypothetically, if training_thread.pause()
sends an (asynchronous)
pause signal to your thread_worker and then returns (without blocking),
the call to model.eval()
could muck up the computation graph before
your threadworker had, in fact, safely completed its epoch. (This is just
one example of the kinds of things that could go wrong.)
This does suggest that model.eval()
is somehow messing up the
computation graph. The deepcopy()
creates new tensors that are not
part of the original computation graph, potentially protecting you from
the inplace-modification error. That is, model.eval()
may still be
(somehow) modifying the deepcopied model’s computation graph inplace,
but this doesn’t cause an error when training_thread
calls .backward()
and / or optimizer.step()
on the original model.
I would nonetheless advise getting to the bottom of your original issue.
Because of their “randomness,” threading bugs can cause infrequent,
“unexpected” errors. The concern is that your brute-force “solution”
might be sweeping the real issue under the rug by eliminating one of
the frequent (or more obvious) consequences of the core bug, but not
eliminating less frequent (or less obvious) errors.
Good luck!
K. Frank