Tensor move semantics in C++ frontend

ericrhenry · April 22, 2020, 6:09pm

I am in the process of doing some deep embedding of libtorch into a larger application. This involves, for example, creating wrappers around datasets so they can be persistent; this contrasts with the usual behavior evident in tutorials in which datasets are created on the fly (e.g. in main()), processed, and then destructed when the example ends.
In browsing the source code using KDevelop, there are a huge number of invocations of std::move(X), where X is a Tensor, or a collection of Tensors. For example, when initializing an optimizer, the first argument is usually something like ‘module->parameters()’. When you look at the constructor for the optimizer, a std::move is usually invoked on this argument.
I have been trying to unravel the semantics of these operations, given that caffe2::Tensor carries an intrusive_ptr to TensorImpl, which itself inherits from torch::intrusive_ptr_target. Tensor is described as “moveable”, having ‘default’ move-constructor and move-assignment, but “delete” copy-constructor and copy-assignment.
What I am trying to be certain of is whether, for example, parameter Tensors maintained by a Module retain their state after being used to construct another object. I know that the parameters() method actually constructs a vector of Tensors from the named_parameters member, and I am guessing that this construction effectively just increments a refcount_ on existing data. (Or does it? A vector.push_back(Tensor) can’t copy it, because the copy constructor is disabled–or am I missing something?)
I apologize that I am still getting up to speed on all the C++11/14 culture, having found C++98 adequate for my needs for the last couple of decades. The Pytorch C++ frontend strikes me as very deep and elegant under the hood, but ferreting out details is a bit challenging.
Thanks,
Eric

albanD · April 22, 2020, 6:16pm

Hi,

I’m not a cpp specialist but the idea I keep in mind is the following: torch::Tensor can be seen as std::shared_ptr<TensorImpl>.
So doing auto foo = bar; will just bump the version counter in the TensorImpl and give you the exact same actual tensor.
And doing auto foo = std::move(bar); means that you steal the reference. So now foo still contains the same tensor , but the version counter bump did not happen and bar should not be used anymore.

ericrhenry · April 22, 2020, 6:24pm

Thanks for the quick reply. This is what I was worried about. The reference stealing implicit in std::move suggests that you can only access the internally maintained parameters array of a Module once in this way before it is invalidated.
This is still a little confusing, because you would still need some persistence of internal state of a Module in order to serialize it. I would love to see an example of where the sort of dataset persistence (e.g., in a wrapper class) I am looking for has been done successfully.

albanD · April 22, 2020, 6:52pm

I’m not sure what you mean here.
Which move in the optimizer code do you mean above?

Note that if the object is given by value to the function (already a version bump), then an std::move can be use after to steal only the refcount of that local copy. It won’t influence things outside of that function.

ericrhenry · April 22, 2020, 7:18pm

Apologies, the optimizer might have been a bad example. It appears that the constructors of most objects I am dealing with at the moment–for example, torch::data::Example<>–do appear to take Tensor arguments passed by value. In that case, of course, there shouldn’t be an issue.
I guess I’m still wrestling with the idea of a “local copy” of a Tensor, which appears not to have a copy constructor. The Tensor class itself does not inherit from anything, it just maintains an intrusive_ptr to a TensorImpl, so it doesn’t inherit any copy semantics from anything itself. (Again, the copy constructor appears to be set to ‘delete’.)
It does have a default move constructor, and I guess a ‘default’ move constructor means that a move is executed on each member, which means a move on the intrusive_ptr being held by it. But it appears that the move constructor on the intrusive_ptr nulls the source pointer, which means the source Tensor has no state any more.
Sorry, there is a certain amount of thinking out loud here, but I’m just trying to understand how Tensor objects work under the hood, so I can use them in a way that doesn’t blow up my code. If re-use isn’t safe, I may have to resort to the clone() method, which strikes me as inefficient. But the gaps in my understanding are still quite large, so I am keeping my mind open.
Thanks again.

albanD · April 22, 2020, 8:01pm

I think in general, no function should ever “steal” your reference to the TensorImpl and make your Tensor object invalid.
You can always pass everything by value and never use std::move if you want and it will all work.
Afterwards, you can do a minor optimization (remove one version counter bump) by using std::move() when the original Tensor should not be used anymore.

ericrhenry · April 22, 2020, 8:36pm

Thanks, I absolutely agree with you, in principle. However, there is no escaping the
std::move(data) in the Example<> constructor. This means that any Tensor I create locally and use to create an element of a Dataset will be invalidated when I create an Example<Tensor,Tensor>.

This has no impact on my current implementation, because in all cases where I do that, the local Tensor objects are allowed to go out of scope without accessing them again. My main concern is when I am passing around containers which have Tensors somewhere in them.

I may just have to try some things, and hover over the running code with my handy ‘gdb’ (or alternatively just let valgrind do the heavy lifting), in case something gets pulled out from under me.

As I said, a working example involving something more complicated than the available tutorials might be helpful. I am going to leave the topic open for at least a bit, in case a C++ frontend guru chimes in.

albanD · April 22, 2020, 8:41pm

Hi,

I don’t think so, here, the data is passed by value to the function. So a new Tensor object is created. And this new Tensor object is the one being moved. So your own Tensor that you used to create the Example won’t be changed.

ericrhenry · April 22, 2020, 8:55pm

Yes, I agree with the pass-by-value semantics there. Perhaps I’m being dense here (long day), but I’m wrestling with how a truly new Tensor object is created from the existing one without a copy constructor. This probably points up a gap in my understanding of copy/move semantics and the like, but what does the compiler do in that case? (Other than, of course, to just push the original tensor onto the stack…)
Sorry, I didn’t want to turn this into a meta-discussion about C++. I do appreciate your engaging with me on this. I can always just look at the disassembled code…

albanD · April 22, 2020, 9:37pm

I think it depends what you mean by “truly new Tensor”

If you just want to make sure the same underlying Tensor stays alive. doing auto a = t; does the trick. You can see it as taking another reference from a shared pointer.
If you want a new tensor object (so be able to change metadata, autograd info, etc), you can do auto a = t.alias(); to get it in a differentiable way (gradients will flow back), or auto a = t.detach(); to prevent gradient propagation. Note that this will share data with the original tensor. So changing the content of t inplace will also change a.
If you want a new tensor object and new memory, you should do auto a = t.clone(); to get it in a differentiable way and auto a = t.detach().clone(); to prevent gradients from flowing.

ericrhenry · April 22, 2020, 10:13pm

Many thanks, you’ve given me some ideas to work with. I still need to gain some confidence with the container issue. As you imply, setting things up so that doing “sensible” things results in unexpected behavior is probably not a good approach, so probably is not operative here.