Copy_() and memory format

seliad · May 20, 2020, 8:38am

I know that in pytorch 1.5 to() and clone() can preserve formats and therefore we can send non-contiguous tensors between devices.

I wonder, what is the case for copy_()? can we send non-contagious tensors with it?

If not, is there any suggested workaround for avoiding copy?
for example

a = torch.randn(10,1, device="cuda:1").share_memory_()
b = torch.randn(10,2, device="cuda:0")
b = torch.transpose(b, 0,1)
a.copy_(b)  # is it OK?

In the example above we want to avoid using to()/clone() to avoid creating a new tensor and moving it to shared memory.

albanD · May 20, 2020, 3:50pm

Hi,

.copy_() will not change the contiguity of any Tensor.
It will just read the content from b and write it to a. Not changing the size/strides.

VitalyFedyunin · May 20, 2020, 3:50pm

a.copy_ is inplace operation and it never changes strides (and memory format) of a. So the result of a.copy_(b) going to be a with data of b and strides of a.

seliad · May 20, 2020, 3:54pm

Thanks! Is there a way to copy the stride?

VitalyFedyunin · May 20, 2020, 4:11pm

If a is the same size as b, you can aggressively restride a with a.as_strided_(b.shape, b.stride()) and do a.copy_(b) as next step. But this will surely break autograd.

albanD · May 20, 2020, 4:34pm

For (good) reasons, .as_strided() is actually supported by the autograd, so that will work

But even beyond that, here, since you override all the content of a anyway with the copy, all the gradients will flow towards b and the original value of a will just get 0s.
So autograd will work just fine

seliad · May 20, 2020, 4:37pm

I somewhat still get data corruption, doing just the forward pass and calling as_strided_ when sizes are equal.
If I replace copy_() with to(), its totally OK.

I wonder if its related to cuda streams or something?
a.copy_(b) when a,b are on different devices?

albanD · May 20, 2020, 4:41pm

Well, it’s not because they have the same size that the as_strided will be valid.
Is a contiguous ? If a has some overlapping memory already, then it’s backing memory won’t be big enough

seliad · May 20, 2020, 4:53pm

I made sure its contiguous too
Here is what I did.

device = ...
ranks = [....]
saved = [None for rank in ranks]
a = saved[rank]
if a is not None and b.size() == a.size() and b.storage_offset() == a.storage_offset() and b.stride() == a.stride() and b.is_contiguous() and a.is_contiguous():
    # no need to call as_strided_
    a.copy_(b)
else:
    a = b.to(device)
    saved[rank] = a

when we replace that ugly if with if False: everything works.
(what happens next is sending a through queue and cloning the the receiver, and then normal neural net)

albanD · May 20, 2020, 5:06pm

One if the risks of as_strided is that you can do fairly bad stuff. In particular, these are no checks and you can end up reading out of bounds of the original Tensor’s values (or even out of the Storage backing it).

In the ocd above, if you just do .to() the first time and .copy_() afterwards. The layout of a will just be the layout of the first b. Is that not ok?
Why is it so important to keep b’s layout at each iteration?

seliad · May 20, 2020, 5:28pm

I thought it would be OK (that’s exactly what I did at first, just the is not None check)
but it didn’t work.
Then I gradually increased the checks.

albanD · May 20, 2020, 6:12pm

So you mean this does not work?

device = ...
ranks = [....]
saved = [None for rank in ranks]
a = saved[rank]
if a is not None:
    a.copy_(b)
else:
    a = b.to(device)
    saved[rank] = a

What is the issue you see with this?

seliad · May 20, 2020, 6:34pm

exactly.

I tried 2 networks.

WideResnet: it works fine with copy_(). all b tensors are contiguous there.
GPT2 (from huggingface): does not work. I know that some b tensors are non-contiguous.

For the GPT2 the task is zero-shot on wikitext2.
With to(device) I restore the perplexity from the paper (~29)
With copy_() the perplexity explodes.(high crazy numbers, like 8400, 121931 and so on)

I tired to look at the tensors with the debugger, they look fine.

albanD · May 20, 2020, 6:46pm

Ho ok… that looks more like a multiprocessing issue then if the Tensors looks fine?
Layout won’t change anything about the values computed (unless bugs). So you should not see any difference here !

Can you make a small code sample that repro this?

VitalyFedyunin · May 20, 2020, 6:51pm

Can you please try to cuda.synchronize() after copy_ calls? When you are looking at tensors in debugger you are actually synchronizing to make them (data) observable.

seliad · May 23, 2020, 1:00pm

Hi,
I found the bug. It was indeed was some multiprocessing/threading/multi-cuda-stream/ issue + contiguous we discussed.
Everything solved and copy_() works. Thanks

seliad · June 12, 2020, 8:04am

In case it will help someone:
I actually do notice some deadlocks when combining to() and copy_() like mentioned above.
Deadlock happens inside (i.e compiled cuda code) of to() at the second call, that is
to(), copy_(), copy_(),…,copy_(), to(), deadlock.
Anyway, I got frustrated and now I’m using only to(), as its quite minor optimization I wasted too much time on.

I think its related to RTX2080ti not supporting p2p (I looked at the cuda code that does the to() few weeks ago, and I think it assumes something about p2p, but I did not bother to check it thoroughly.)

I can’t share my code (yet), but as soon as I will I’ll share the full example

albanD · June 12, 2020, 1:54pm

Hi,

If you have a repro that deadlock on a given hardware (and is reproducible on other similar card to be sure it’s not a faulty card), please open an issue on github! Thanks.