I seem to be bottlenecked by shared memory transfers. Eg it seems remarkably slow:
for _ in range(64):
start = time.time()
x = torch.zeros([int(1e9)], dtype=torch.uint8)
y = x.share_memory_()
dt = time.time() - start
print('dt=', dt)
del x, y
Some benchmarks, given a 1GB uint8 tensor;
.clone()
: 10GB/s
.cuda()
: 5GB/s
.pin_memory()
: 10GB/s
.cuda() [when pinned]
: 10GB/s
.share_memory_()
: 1GB/s