How to deep clone a network (module) by C++ API?

fulltopic · September 11, 2020, 6:29pm

I’ve read topic Are there any recommended methods to clone a model?. Seemed that there are off-the-shelf methods for Python applications. Then what’s the corresponding method for C++ application? Any extra work (e.g. interface implementation) required for customized module?

Thank you very much.

tom · September 11, 2020, 6:41pm

I think module::clone() works either on Cloneable-subclasses or AnyModule (there is some trickyness going on that makes it difficult for Module itself

Best regards

Thomas

fulltopic · September 12, 2020, 1:20am

Thank you for the instruction. Following example could be a demo of Double DQN?

struct TestModule: public torch::nn::Cloneable<TestModule> {
	torch::nn::GRU gru;
	torch::nn::Linear fc;
	int testMember;

	TestModule(int testValue):
		gru(torch::nn::GRUOptions(128, 1024).num_layers(2).batch_first(true)),
			fc(1024,16),
			testMember(testValue) {
		register_module("gru", gru);
		register_module("fc", fc);
	}

	TestModule(const TestModule& other) = default;
	TestModule& operator=(TestModule& other) = default;
	TestModule(TestModule&& other) = default;
	TestModule& operator=(TestModule&& other) = default;

	~TestModule() = default;

	void reset() override {
		register_module("gru", gru);
		register_module("fc", fc);
	}

	torch::Tensor forward(torch::Tensor input) {
		return input;
		//do sth.
	}
};

void testCloneable() {
	std::shared_ptr<TestModule> net(new TestModule(27));
	torch::optim::RMSprop optimizer(net->parameters(), torch::optim::RMSpropOptions(1e-3).eps(1e-8).alpha(0.99));
	//forward and backward of net


	auto copy = net->clone();
	std::shared_ptr<TestModule> cpyNet = std::dynamic_pointer_cast<TestModule>(copy);
	torch::optim::RMSprop cpyOptimizer(cpyNet->parameters(), torch::optim::RMSpropOptions(1e-3).eps(1e-8).alpha(0.99));
	//forward and backward of cpyNet
}

Thank you very much.

fulltopic · September 12, 2020, 6:54am

I’ve tried by following function, and the nets are the same before and after updating(loss.backward + optimizer.step) of original network. Seemed it is not a deep clone?

template<typename NetType>
static bool compNet(std::shared_ptr<NetType> net0, std::shared_ptr<NetType> net1) {
	cout << endl << endl;
	cout << "Compare nets " << endl;

	auto params0 = net0->named_parameters(true);
	auto params1 = net1->named_parameters(true);
	for (auto ite = params0.begin(); ite != params0.end(); ite ++) {
		auto key = ite->key();
		cout << "Test param " << key << ": ---------------------------> " << endl;

		Tensor v0 = ite->value();
		Tensor* v1 = params1.find(key);
		if (v1->is_same(v0)) {
			cout << "Not clone, just imp pointer copied" << endl;
//			return false;
		}

		if (v1 == nullptr) {
			cout << "Could not find " << key << " in net1 " << endl;
			return false;
		}

		if (v0.dim() != v1->dim()) {
			cout << "Param " << key << " have different dim " << v0.dim() << " != " << v1->dim() << endl;
			return false;
		}
		auto numel0 = v0.numel();
		auto numel1 = v1->numel();
		if (numel0 != numel1) {
			cout << "Different cell number: " << numel0 << " != " << numel1 << endl;
			return false;
		}

		auto size0 = v0.sizes();
		auto size1 = v1->sizes();
		for (int i = 0; i < v0.dim(); i ++) {
			if (v0.size(i) != v1->size(i)) {
				cout << "Size does match at dim " << i << " " << v0.size(i) << " != " << v1->size(i) << endl;
				return false;
			}
		}

		auto data0 = v0.data_ptr<float>();
		auto data1 = v1->data_ptr<float>();
		for (int i = 0; i < numel0; i ++) {
			if (data0[i] != data1[i]) {
				cout << "Different value of element " << i << ": " << data0[i] << " != " << data1[i] << endl;
				return false;
			}
		}
	}

	vector<Tensor> buffers0 = net0->buffers(true);
	vector<Tensor> buffers1 = net1->buffers(true);
	for (int i = 0; i < buffers0.size(); i ++) {
		if (!buffers0[i].equal(buffers1[i])) {
			cout << "Buffer at " << i << " does not match " << endl;
			return false;
		}
	}

	auto children0 = net0->children();
	auto children1 = net1->children();
	if (children0.size() != children1.size()) {
		cout << "Different size of children: " << children0.size() << " != " << children1.size() << endl;
		return false;
	}
	for (int i = 0; i < children0.size(); i ++) {
		cout << "Name" << i << ": " << children0[i]->name() << ", " << children1[i]->name() << endl;
	}


	return true;
}

fulltopic · September 13, 2020, 2:46am

The reasons that caused above experiment result:

Default copy constructor just copy the sub-module pionters.
cloneable.clone() just cloned members with recursive=false

fatvlad · September 18, 2020, 12:56pm

I had similar situation with torch::jit::Module. Module::deepcopy didn’t work for me - gradients were undefined for the new module instance. However serializing-deserializing seems to work here.

Lin_Jia · September 23, 2020, 5:08am

At first, it seems after cloning something, you need to invoke detach:

I have not tried the clone method yet, but “first serialize class to disk, then load it to a new class instantiation” guarantee a safe and clean deep clone. The disadvantage is that it is not high performance compared to a clone API call.

FenixFromTheShadows · June 3, 2021, 1:38am

OK if anyone is still interested, I had the tread author’s mystery solved, at least for the most part.

The problem lies in the implementation of the pure virtual function “reset”, which is called by the clone function. The key issue is that, it is not enough to re-register the submodules there, but it should also re-construct the submodules (or network layers; maybe I am not using the correct terminology), i.e. the “gru”, “fc” in the author’s example (I’m ignoring the testMember here as it is probably not a network parameter and the copy of this integer member, if needed, should be relatively easy), and the re-construction needs to be done before re-registering. To be more concrete, the reset function in the author’s example should be implemented something like below. (Disclaimer: I did not test this exact example but tested on my own example, but the issue is similar enough for having confidence.)

void reset() override {
    gru = torch::nn::GRUOptions(128, 1024).num_layers(2).batch_first(true);
    fc = torch::nn::Linear(1024, 16);
    register_module("gru", gru);
    register_module("fc", fc);
}

The reason behind this, if I am not understanding it wrong, is that if we don’t re-construct the submodules and only re-register them, the parameters are going to be registered to the same memory space as those in the old submodules, thus causing the clone being not clean.

This is really an annoying issue and took me quite some time to figure it out. Thank the author for providing me some inspirations on it. I hope the documentation can be clearer on this. Hope this can help whoever running into this issue later.