Copy_() and memory format

I know that in pytorch 1.5 to() and clone() can preserve formats and therefore we can send non-contiguous tensors between devices.

I wonder, what is the case for copy_()? can we send non-contagious tensors with it?

If not, is there any suggested workaround for avoiding copy?
for example

a = torch.randn(10,1, device="cuda:1").share_memory_()
b = torch.randn(10,2, device="cuda:0")
b = torch.transpose(b, 0,1)
a.copy_(b)  # is it OK?

In the example above we want to avoid using to()/clone() to avoid creating a new tensor and moving it to shared memory.

1 Like

Hi,

.copy_() will not change the contiguity of any Tensor.
It will just read the content from b and write it to a. Not changing the size/strides.

1 Like

a.copy_ is inplace operation and it never changes strides (and memory format) of a. So the result of a.copy_(b) going to be a with data of b and strides of a.

2 Likes

Thanks! Is there a way to copy the stride?

If a is the same size as b, you can aggressively restride a with a.as_strided_(b.shape, b.stride()) and do a.copy_(b) as next step. But this will surely break autograd.

1 Like

For (good) reasons, .as_strided() is actually supported by the autograd, so that will work :smiley:

But even beyond that, here, since you override all the content of a anyway with the copy, all the gradients will flow towards b and the original value of a will just get 0s.
So autograd will work just fine :slight_smile:

1 Like

I somewhat still get data corruption, doing just the forward pass and calling as_strided_ when sizes are equal.
If I replace copy_() with to(), its totally OK.

I wonder if its related to cuda streams or something?
a.copy_(b) when a,b are on different devices?

Well, it’s not because they have the same size that the as_strided will be valid.
Is a contiguous ? If a has some overlapping memory already, then it’s backing memory won’t be big enough :confused:

I made sure its contiguous too
Here is what I did.

device = ...
ranks = [....]
saved = [None for rank in ranks]
a = saved[rank]
if a is not None and b.size() == a.size() and b.storage_offset() == a.storage_offset() and b.stride() == a.stride() and b.is_contiguous() and a.is_contiguous():
    # no need to call as_strided_
    a.copy_(b)
else:
    a = b.to(device)
    saved[rank] = a

when we replace that ugly if with if False: everything works.
(what happens next is sending a through queue and cloning the the receiver, and then normal neural net)

One if the risks of as_strided is that you can do fairly bad stuff. In particular, these are no checks and you can end up reading out of bounds of the original Tensor’s values (or even out of the Storage backing it).

In the ocd above, if you just do .to() the first time and .copy_() afterwards. The layout of a will just be the layout of the first b. Is that not ok?
Why is it so important to keep b’s layout at each iteration?

I thought it would be OK (that’s exactly what I did at first, just the is not None check)
but it didn’t work.
Then I gradually increased the checks.

So you mean this does not work?

device = ...
ranks = [....]
saved = [None for rank in ranks]
a = saved[rank]
if a is not None:
    a.copy_(b)
else:
    a = b.to(device)
    saved[rank] = a

What is the issue you see with this?

exactly.

I tried 2 networks.

  1. WideResnet: it works fine with copy_(). all b tensors are contiguous there.
  2. GPT2 (from huggingface): does not work. I know that some b tensors are non-contiguous.

For the GPT2 the task is zero-shot on wikitext2.
With to(device) I restore the perplexity from the paper (~29)
With copy_() the perplexity explodes.(high crazy numbers, like 8400, 121931 and so on)

I tired to look at the tensors with the debugger, they look fine.

Ho ok… that looks more like a multiprocessing issue then if the Tensors looks fine?
Layout won’t change anything about the values computed (unless bugs). So you should not see any difference here !

Can you make a small code sample that repro this?

Can you please try to cuda.synchronize() after copy_ calls? When you are looking at tensors in debugger you are actually synchronizing to make them (data) observable.

1 Like

Hi,
I found the bug. It was indeed was some multiprocessing/threading/multi-cuda-stream/ issue + contiguous we discussed.
Everything solved and copy_() works. Thanks :slight_smile:

1 Like

In case it will help someone:
I actually do notice some deadlocks when combining to() and copy_() like mentioned above.
Deadlock happens inside (i.e compiled cuda code) of to() at the second call, that is
to(), copy_(), copy_(),…,copy_(), to(), deadlock.
Anyway, I got frustrated and now I’m using only to(), as its quite minor optimization I wasted too much time on.

I think its related to RTX2080ti not supporting p2p (I looked at the cuda code that does the to() few weeks ago, and I think it assumes something about p2p, but I did not bother to check it thoroughly.)

I can’t share my code (yet), but as soon as I will I’ll share the full example

1 Like

Hi,

If you have a repro that deadlock on a given hardware (and is reproducible on other similar card to be sure it’s not a faulty card), please open an issue on github! Thanks.