My high level understanding of pinned memory is that it speeds up data transfer from CPU to GPU…in some cases. I understand this can commonly be used in dataloaders when copying loaded data from host to device.
When else would this be useful? I have been trying to use the tensor pin_memory()
function, but I’m not seeing significant speed up in copying a large matrix to the GPU.
This is my testing code:
import torch
import time
# Warm up
q = torch.rand(10000, 10000).cuda()
w = torch.rand(10000, 10000).cuda()
for i in range(10):
qq = q * w
# Test pinning
b = torch.arange(1000000).pin_memory()
c = torch.arange(1000000)
print("BEFORE")
print("b is pinned?: ", b.is_pinned())
print("c is pinned?: ", c.is_pinned())
print("b is cuda?: ", b.is_cuda)
print("c is cuda?: ", c.is_cuda)
print("\nRESULTS")
torch.cuda.synchronize()
s = time.time()
b[:] = c # <<<<<< Time goes down without this, obviously, but what good is pinned memory if it always points to the same stuff?
d = b.to(torch.device("cuda"), non_blocking=True)
torch.cuda.synchronize()
print("Copy pinned (non-blocking): ", time.time() - s)
torch.cuda.synchronize()
s = time.time()
e = b.to(torch.device("cuda"), non_blocking=False)
torch.cuda.synchronize()
print("Copy pinned (blocking): ", time.time() - s)
torch.cuda.synchronize()
s = time.time()
f = c.to(torch.device("cuda"))
torch.cuda.synchronize()
print("Copy unpinned: ", time.time() - s)
print("\nAFTER")
print("b is pinned?: ", b.is_pinned())
print("c is pinned?: ", c.is_pinned())
print("d is pinned?: ", d.is_pinned())
print("e is pinned?: ", e.is_pinned())
print("f is pinned?: ", f.is_pinned())
print("b is cuda?: ", b.is_cuda)
print("c is cuda?: ", c.is_cuda)
print("d is cuda?: ", d.is_cuda)
print("e is cuda?: ", e.is_cuda)
print("f is cuda?: ", f.is_cuda)
Here is the results on a 2080Ti:
BEFORE
b is pinned?: True
c is pinned?: False
b is cuda?: False
c is cuda?: False
RESULTS
Copy pinned (non-blocking): 0.0015006065368652344
Copy pinned (blocking): 0.0006394386291503906
Copy unpinned: 0.0007956027984619141
AFTER
b is pinned?: True
c is pinned?: False
d is pinned?: False
e is pinned?: False
f is pinned?: False
b is cuda?: False
c is cuda?: False
d is cuda?: True
e is cuda?: True
f is cuda?: True
Now I would have expected the non-blocking pinned memory to be fastest, but it’s actually slower than simple copy. The real time suck is the in-place assignment of novel data (noted in the code above), but what good, then, is a pinned-memory tensor if the data is never going to change? In this example, I am treating b
like a pinned memory buffer of sorts. Is this the wrong way to use/think about pinned memory?