Freeze part of model parameters cause cuda error

ericlormul · March 3, 2021, 8:20pm

Here’s my observations:

train the whole model without freezing any parameters. one epoch can finish with no problem.
freeze certain parameters. it throws “transform: failed to synchronize: cudaErrorAssert: device-side assert triggered” at random point. In my case, one epoch has 1170 batches, no shuffle, sequential feed in, it usually errors at about 600 to 800 batches.
haven’t tried on cpu, cause it takes more than 2 hours to reach ~600 batches.

This makes me think that: if not optimizing all parameters, certain parameters may accumulate to some error status like -inf as training goes, finally causes the error?

Any insights? Thanks.

ptrblck · March 4, 2021, 10:45am

This could be the case and the assertion could point towards e.g. a failing index operation.
Could you rerun the code via:

CUDA_LAUNCH_BLOCKING=1 python script.py args

and post the complete stack trace here?

ericlormul · March 7, 2021, 5:59pm

Hi ptrblck,

Thank you so much for the reply. I post the complete stack trace here. It seems that bce_loss encounters invalid inputs.

After exploring any possible causes, I finally found that non-freezed parameter matrix will have nan value after ~600 iterations due to nan grad. I tested that manually set nan values to 0 if exist, which can successfully finish the training without error. My question is why grad of unfreezed trainable parameters can become nan after freezing certain parameters? Thanks.

update:
after adding torch.autograd.set_detect_anomaly(True), the stack trace indicates:
[W python_anomaly_mode.cpp:104] Warning: Error detected in PowBackward0. Traceback of forward call that caused the error:
File “/…/model/compgcn_conv.py”, line 81, in compute_norm
deg_inv = deg.pow(-0.5) # D^{-0.5}
(function _print_stack)
RuntimeError: Function ‘PowBackward0’ returned nan values in its 0th output.

the corresponding code is :

	def compute_norm(self, edge_index, num_ent, edge_weight=None):
		row, col	= edge_index
		edge_weight 	= torch.ones_like(row).float() if edge_weight is None else edge_weight 
		deg		= scatter_add( edge_weight, row, dim=0, dim_size=num_ent)	# Summing number of weights of the edges
		assert torch.isnan(deg).sum() == 0 # added for testing, this assert won't get triggered, meaning no nan in deg
		deg = deg.abs() # added for testing, forcing all positive, but doesn't work; however, "deg = deg.abs() + 1" works with no errors.
		deg_inv		= deg.pow(-0.5)							# D^{-0.5}
		deg_inv[deg_inv	== float('inf')] = 0
		norm		= deg_inv[row] * edge_weight * deg_inv[col]			# D^{-0.5}	

		return norm

and this happens at the first iteration not ~600, confused.

ptrblck · March 8, 2021, 6:12am

I guess deg might be a zero tensor, which would create NaN gradients:

x = torch.tensor(0., requires_grad=True)
y = torch.pow(x, 0.5)
y.backward()
print(x.grad)
> tensor(nan)

ericlormul · March 8, 2021, 6:44am

Yeah, deg always contains zero elements. This explains why torch.autograd.set_detect_anomaly(True) triggers warning, but this warning shouldn’t be considered as an error, right?

What confuses me is for the first ~600 iteration, there is no error. Then, nan grad appears, causing certain parameters to be updated to nan. Thus, in next iteration, due to nan parameters, pred tensor contains nan and bce_loss errors. Any insights on how to debug this. Thanks.

ptrblck · March 8, 2021, 6:48am

When you enable anomaly detection, the NaNs will trigger an error. If you are not concerned about it, you could disable anomaly detection and handle the NaN values separately.
The idea of using this utility is to get a runtime error in order to debug the issue.

I guess that the first 600 iterations don’t contain an exact zero and would thus still create valid gradients. To void the NaN gradient, you could add a small eps value to the pow operation.

ericlormul · March 8, 2021, 7:11am

Thank you so much! zero elements create nan grad in powbackward is exactly the cause. Your explanation really helps me fully understand this deg pow operation in my design scenario. Thank you again, thought this is my first post, I have read a lot of your responses in the forum, from which I learn a lot.