Copy to shared memory remarkably slow

I seem to be bottlenecked by shared memory transfers. Eg it seems remarkably slow:

for _ in range(64):
        start = time.time()
        x = torch.zeros([int(1e9)], dtype=torch.uint8)
        y = x.share_memory_()
        dt = time.time() - start
        print('dt=', dt)
        del x, y

Some benchmarks, given a 1GB uint8 tensor;
.clone(): 10GB/s
.cuda(): 5GB/s
.pin_memory(): 10GB/s
.cuda() [when pinned]: 10GB/s
.share_memory_(): 1GB/s :red_circle: