Copying weights and biases

jchmack · February 26, 2023, 2:04am

I am trying to convert this tutorial to libtorch:
https://pytorch.org/tutorials/intermediate/reinforcement_q_learning.html

My problem is with this line(in python):
target_net.load_state_dict(policy_net.state_dict())

Because there isn’t a libtorch equivalent I used this workaround:

void loadstatedict(torch::nn::Module& model,torch::nn::Module& target_model) 
{
	torch::autograd::GradMode::set_enabled(false);  // make parameters copying possible
	auto new_params = target_model.named_parameters(); // implement this
	auto params = model.named_parameters(true /*recurse*/);
	auto buffers = model.named_buffers(true /*recurse*/);

	for (auto& val : new_params)
	{
		auto name = val.key();
		auto* t = params.find(name);
		if (t != nullptr)
			t->copy_(val.value());
		else
		{
			t = buffers.find(name);
			if (t != nullptr)
				t->copy_(val.value());
		}
	}
	
	torch::autograd::GradMode::set_enabled(true); //Set back
}

Everything seems to work but after long runs I usually get these from cuda during runtime:

C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\ScatterGatherKernel.cu:145: block: [0,0,0], thread: [0,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.

CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

I have tried:
CUDA_LAUNCH_BLOCKING=1
torch::cuda::synchronize() before the copy
Moving both models to the cpu before copying.
Running only on CPU(It doesn’t SEEM to happen in when running only on the cpu.)

It feels like some problem syncing data between cpu/gpu. Everything runs fine for long runs until I start to do the copy(loadstatedict) and even then it seems like it works 99% of the time. If I disable the copy it never crashes.

using Libtorch 1.13.1+cu117

Thanks for any info.

ptrblck · February 26, 2023, 3:30am

The error points to an invalid index in:

C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\ScatterGatherKernel.cu:145: block: [0,0,0], thread: [0,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.

Running the code with CUDA_LAUNCH_BLOCKING=1 should show you the line of code causing this issue. Once isolated you could print the indices and check why they contain wrong values.

jchmack · February 26, 2023, 11:01am

At crash time:
next_target_q_values:
0.01 *
-1.0695
-5.1743
-2.5427
3.3140
4.4763
0.1929
3.3601
3.3892
1.8363
6.2410
[ CUDAFloatType{10} ]

maximum:
10.1474
[ CUDAFloatType{} ]

Crash on this line:
torch::Tensor next_q_value = next_target_q_values.gather(0, maximum.to(torch::kInt64).unsqueeze(0)).squeeze(0);

So maximum evaluates to 10 which is outside the range

This is what I have:

torch::Tensor Learn(torch::Tensor state, torch::Tensor action, torch::Tensor reward, torch::Tensor next_state, torch::Tensor done)
	{

		torch::Device mTorchDevice = torch::Device(mDeviceType);
		bool Display = true;

		MyQnet->zero_grad();
		TargetQnet->zero_grad();

		torch::Tensor q_values = MyQnet->forward(state);																							
		torch::Tensor next_target_q_values = TargetQnet->forward(next_state);
		torch::Tensor next_q_values = MyQnet->forward(next_state);
		torch::Tensor q_value = q_values.gather(0, action.unsqueeze(0)).squeeze(0);
		torch::Tensor maximum = std::get<0(next_q_values.max(0));																		
		torch::Tensor next_q_value = next_target_q_values.gather(0, maximum.to(torch::kInt64).unsqueeze(0)).squeeze(0);
		torch::Tensor expected_q_value = reward + FutureRewardRate * next_q_value * (1 - done.to(torch::kFloat32));				
		torch::Tensor loss = torch::mse_loss(q_value, expected_q_value);

		MyQnet_optimizer.zero_grad();
		loss.backward();
		MyQnet_optimizer.step();

		static int Copy_Count = 0;
		Copy_Count++;
		if (Copy_Count == 1000)
		{
			Copy_Count = 0;
			loadstatedict(*MyQnet, *TargetQnet);
		}

		return q_values;
	}

I think i should be taking the maximum of next_target_q_values instead of next_q_values
or that max should be an argmax

I’m trying to follow this and I don’t understand why he does it that way:

github.com

navneet-nmk/Pytorch-RL-CPP/blob/master/Trainer.cpp

//
// Created by Navneet Madhu Kumar on 2019-07-10.
//
#include "Trainer.h"
#include "dqn.h"
#include "ExperienceReplay.h"
#include "/Users/navneetmadhukumar/Downloads/Arcade-Learning-Environment-master/src/ale_interface.hpp"
#include <math.h>
#include <chrono>


Trainer::Trainer(int64_t input_channels, int64_t num_actions, int64_t capacity):
        buffer(capacity),
        network(input_channels, num_actions),
        target_network(input_channels, num_actions),
        dqn_optimizer(
            network.parameters(), torch::optim::AdamOptions(0.0001).beta1(0.5)){}

    torch::Tensor Trainer::compute_td_loss(int64_t batch_size, float gamma){
        std::vector<std::tuple<torch::Tensor, torch::Tensor, torch::Tensor, torch::Tensor, torch::Tensor>> batch =

This file has been truncated. show original

ptrblck · February 26, 2023, 8:17pm

Good to hear you were able to isolate the invalid indexing!
I’m not familiar enough with the code so don’t know which value should be used and if next_q_values is wrong or just needs clipping.

jchmack · February 27, 2023, 2:07am

For anyone who might be searching about this tutorial in the future:

torch::Tensor next_q_value = next_target_q_values.gather(0, maximum.to(torch::kInt64).unsqueeze(0)).squeeze(0);

should be:

torch::Tensor next_q_value = next_target_q_values.gather(0, next_q_values.argmax(0).unsqueeze(0)).squeeze(0);

Thank you for the tips on how to debug data on gpu+cuda. Knowing where to look is everything.