Hi, I am looking into different ways to optimize the running speed of my code, and one of these is looking at the speed of memory transfers between CPU and GPU, and the performances that I have measured do not seem to match up to the hardware’s theoretical one. I have written the following script:
(note: I decided to re-use the same pinned memory buffer, in order to avoid the overhead from re-allocating it over and over again)
import argparse
import time
import torch
from tqdm import trange
def stress_vram_transfer(
batch_size=10,
warmup=5,
repeats=100,
frame_shape=(3, 3840, 2160),
use_pinned_memory=True,
):
tensor = torch.randn((batch_size, *frame_shape))
if use_pinned_memory:
tensor = tensor.pin_memory()
in_loop_tensor = tensor
for device_id in range(torch.cuda.device_count()):
print(f"Starting test for device {device_id}: {torch.cuda.get_device_properties(device_id)}")
for _ in trange(warmup, desc="warmup"):
in_loop_tensor = in_loop_tensor.cuda()
if use_pinned_memory:
tensor[:] = in_loop_tensor.cpu()
in_loop_tensor = tensor
else:
in_loop_tensor = in_loop_tensor.cpu()
start = time.perf_counter()
for _ in trange(repeats, desc="test"):
in_loop_tensor = in_loop_tensor.cuda()
if use_pinned_memory:
tensor[:] = in_loop_tensor.cpu()
in_loop_tensor = tensor
else:
in_loop_tensor = in_loop_tensor.cpu()
end = time.perf_counter()
print(f"Total time taken: {end-start:.2f}")
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument("--batch_size", type=int, default=10)
parser.add_argument("--warmup", type=int, default=5)
parser.add_argument("--repeats", type=int, default=100)
parser.add_argument("--frame_shape", type=int, nargs=3, default=(3, 3840, 2160))
parser.add_argument("--use_pinned_memory", type=bool, default=True)
parser.add_argument('--no_pin', dest='use_pinned_memory', action='store_false')
args = parser.parse_args()
args = dict(vars(args))
print(args)
stress_vram_transfer(**args)
Using this, and running on an RTX3090, this is what I get:
$ CUDA_VISIBLE_DEVICES=1 python memory_transfer_test.py --batch_size 50 --no_pin
{'batch_size': 50, 'warmup': 5, 'repeats': 100, 'frame_shape': (3, 3840, 2160), 'use_pinned_memory': False}
Starting test for device 0: _CudaDeviceProperties(name='NVIDIA GeForce RTX 3090', major=8, minor=6, total_memory=24268MB, multi_processor_count=82)
warmup: 100%|_____________________________________________________________________________________________________________________________________________| 5/5 [00:13<00:00, 2.71s/it]
test: 100%|___________________________________________________________________________________________________________________________________________| 100/100 [03:42<00:00, 2.22s/it]
Total time taken: 222.21
$ CUDA_VISIBLE_DEVICES=1 python memory_transfer_test.py --batch_size 50
{'batch_size': 50, 'warmup': 5, 'repeats': 100, 'frame_shape': (3, 3840, 2160), 'use_pinned_memory': True}
Starting test for device 0: _CudaDeviceProperties(name='NVIDIA GeForce RTX 3090', major=8, minor=6, total_memory=24268MB, multi_processor_count=82)
warmup: 100%|_____________________________________________________________________________________________________________________________________________| 5/5 [00:13<00:00, 2.64s/it]
test: 100%|___________________________________________________________________________________________________________________________________________| 100/100 [04:23<00:00, 2.63s/it]
Total time taken: 263.24
These results are surprising on several fronts:
- The memory transfer speed is MUCH slower than what the hardware promises: 6770MB vRAM usage by my process is being reported by nvidia-smi, and each step is taking about 2.22s transferring it both ways, which would equate to 2*6770MB/2.22s = 6100MB/s. An RTX3090 is supposed to have 936.2 GB/s memory bandwidth: even if I divide that by 4 accounting for my card only having access to 4x PCIE lanes VS the maximum of 16x, I am still faced with at least a 10x discrepancy in that memory bandwidth.
- The
use_pinned_memory
version actually performs slower than the no_pin version. Looking at the PyTorch documentation, pin_memory() is recommended for faster Host to GPU copies…
Given these observations, I have the following questions:
- Am I wrong to expect the memory bandwidth advertised for my GPU to match up with the tensor transfer speeds in PyTorch?
- Am I using pin_memory() correctly in this test? From my understanding, a major advantage this function brings is the asynchronous GPU copies, but that does not seem like it can be exploited in this scenario.
- If using pin_memory() is supposed to help with “Host to GPU copies”, does it also apply to GPU to Host copies? Or is there a separate trick to use?
- If the information from point 3 on this page is to be believed, PyTorch will actually copy non-pinned tensors to pinned memory before copying it to GPU. Is there any reason we could expect that to be slower than manually copying the tensor to pinned memory, then asking PyTorch to copy it to GPU?