Crash with loss.backward() and autograd

Laurick · June 29, 2024, 12:04pm

Hello,

I am failing to understand how to use backward or autograd

My software keeps crashing each time when performing “auto gradients = torch::autograd::grad({ output }, { input }, { grad_output }, true);”

I also tried examples with loss.backward() only and it also crashed

// Choisir l'appareil : CPU ou GPU
	torch::Device device(torch::kCPU); // Utilisez torch::kCPU pour le CPU

	auto model = torch::nn::Linear(4, 3);

	model->to(device);

	// Génération de données d'entrée et de cible et déplacement sur l'appareil choisi
	auto input = torch::randn({ 3, 4 }).to(device);
	input.requires_grad_(true);

	auto target = torch::randn({ 3, 3 }).to(device);

	auto output = model(input);

	// Calculate loss
	auto loss = torch::nn::MSELoss()(output, target);

	// Utilisation de la norme des gradients comme pénalité
	auto grad_output = torch::ones_like(output).to(device);

	// Calcul des gradients
	auto gradients = torch::autograd::grad({ output }, { input }, { grad_output }, true);
	auto gradient = gradients[0];

	auto gradient_penalty = torch::pow((gradient.norm(2, /*dim=*/1) - 1), 2).mean();

	// Add gradient penalty to loss
	auto combined_loss = loss + gradient_penalty;
	combined_loss.backward();

Do you know what could cause these crashes?

Laurick

Laurick · June 29, 2024, 5:29pm

Before crashing, the software returns the following values :

Input:
0.6896 0.0109 -0.8927 -0.6019
-1.2692 -0.3748 1.2181 1.2204
1.5659 0.6650 0.2665 0.3639
[ CUDAFloatType{3,4} ]

Target:
1.1027 1.2941 0.5516
-1.4952 -0.2794 0.0011
-0.2567 1.1541 0.6320
[ CUDAFloatType{3,3} ]

Output:
0.1708 0.4900 -0.3538
-0.1719 0.5131 -0.8069
0.8201 -0.4959 -0.9887
[ CUDAFloatType{3,3} ]

grad_output:
1 1 1
1 1 1
1 1 1
[ CUDAFloatType{3,3} ]

ptrblck · June 29, 2024, 6:52pm

Could you describe what exactly is crashing and post the error message here, please?

Laurick · July 19, 2024, 10:22pm

Hello,

Thank you for your answer

Sure, this is the message I get :

“exception Microsoft C++ : c10::Error à l’emplacement de mémoire 0x0000009C920FD390”

ptrblck · July 20, 2024, 1:28am

This seems to point to a memory violation on the host. Could you try to create the stacktrace?

Laurick · July 20, 2024, 10:42am

The versions I use are :
C++17
Visual Studio 2019
Libtorch : 2.0.0+cu118

The stacktrace recognizes everything in the code expect when calling for external code:

c10.dll!00007ffe8ce9ce9e()	Inconnu
torch_cpu.dll!00007ffe5f2b1e16()	Inconnu
torch_cpu.dll!00007ffe5f2b19fa()	Inconnu
torch_cpu.dll!00007ffe5f2eb48a()	Inconnu
torch_cpu.dll!00007ffe5cad7f34()	Inconnu
torch_cpu.dll!00007ffe5c8f8b3f()	Inconnu

It seems that the software can’t read c10.dll and torch_cpu.dll ; I don’t know why torch_cuda.dll is not called here

This issue seems to be related to libtorch model predict cuda convert to cpu: C10::error at memory location · Issue #73912 · pytorch/pytorch · GitHub