When is pinning memory useful for tensors (beyond dataloaders)?

marcman411 · November 22, 2020, 6:40pm

My high level understanding of pinned memory is that it speeds up data transfer from CPU to GPU…in some cases. I understand this can commonly be used in dataloaders when copying loaded data from host to device.

When else would this be useful? I have been trying to use the tensor pin_memory() function, but I’m not seeing significant speed up in copying a large matrix to the GPU.

This is my testing code:

import torch
import time

# Warm up
q = torch.rand(10000, 10000).cuda()
w = torch.rand(10000, 10000).cuda()
for i in range(10):
    qq = q * w

# Test pinning
b = torch.arange(1000000).pin_memory()
c = torch.arange(1000000)
print("BEFORE")
print("b is pinned?: ", b.is_pinned())
print("c is pinned?: ", c.is_pinned())
print("b is cuda?: ", b.is_cuda)
print("c is cuda?: ", c.is_cuda)

print("\nRESULTS")
torch.cuda.synchronize()
s = time.time()
b[:] = c # <<<<<< Time goes down without this, obviously, but what good is pinned memory if it always points to the same stuff?
d = b.to(torch.device("cuda"), non_blocking=True)
torch.cuda.synchronize()
print("Copy pinned (non-blocking): ", time.time() - s)

torch.cuda.synchronize()
s = time.time()
e = b.to(torch.device("cuda"), non_blocking=False)
torch.cuda.synchronize()
print("Copy pinned (blocking): ", time.time() - s)

torch.cuda.synchronize()
s = time.time()
f = c.to(torch.device("cuda"))
torch.cuda.synchronize()
print("Copy unpinned: ", time.time() - s)

print("\nAFTER")
print("b is pinned?: ", b.is_pinned())
print("c is pinned?: ", c.is_pinned())
print("d is pinned?: ", d.is_pinned())
print("e is pinned?: ", e.is_pinned())
print("f is pinned?: ", f.is_pinned())
print("b is cuda?: ", b.is_cuda)
print("c is cuda?: ", c.is_cuda)
print("d is cuda?: ", d.is_cuda)
print("e is cuda?: ", e.is_cuda)
print("f is cuda?: ", f.is_cuda)

Here is the results on a 2080Ti:

BEFORE
b is pinned?:  True
c is pinned?:  False
b is cuda?:  False
c is cuda?:  False

RESULTS
Copy pinned (non-blocking):  0.0015006065368652344
Copy pinned (blocking):  0.0006394386291503906
Copy unpinned:  0.0007956027984619141

AFTER
b is pinned?:  True
c is pinned?:  False
d is pinned?:  False
e is pinned?:  False
f is pinned?:  False
b is cuda?:  False
c is cuda?:  False
d is cuda?:  True
e is cuda?:  True
f is cuda?:  True

Now I would have expected the non-blocking pinned memory to be fastest, but it’s actually slower than simple copy. The real time suck is the in-place assignment of novel data (noted in the code above), but what good, then, is a pinned-memory tensor if the data is never going to change? In this example, I am treating b like a pinned memory buffer of sorts. Is this the wrong way to use/think about pinned memory?

ptrblck · November 23, 2020, 9:47pm

Using pinned memory would allow you to copy the data asynchronously to the device, so your GPU won’t be blocking it. The bandwidth is limited by your hardware and the connection to your GPU. Using pinned memory cannot exceed these hardware limitations.

marcman411 · November 23, 2020, 10:26pm

Gotcha, so what sort of circumstance would lead to the GPU blocking the copy? I know pinned memory is recommended for data loading directly to the GPU, but it’s still not abundantly clear to me how it helps.

Is it primarily a way for the data loader to prefetch the next batch onto the GPU while the current batch is being processed? For example, if you have a network architecture that performs some inference task on the GPU. Without pinned memory, execution would be:

Load batch to GPU
Execute inference
Load next batch to GPU
…

Do I understand correctly that with pinned memory, we would have

Load first batch to GPU
(concurrent with 3) Execute inference
(concurrent with 2) Load next batch onto GPU
…

Is that the general idea? If so, does it only execute the asynchronous copy if there is enough GPU RAM to accommodate it?