When to set pin_memory to true?

I meet with an error when I set pin_memory = True, the error is described as following:
" RuntimeError: unsupported operation: more than one element of the written-to tensor refers to a single memory location. Please clone() the tensor before performing the operation."
but when I set pin_memory = False, everything is ok. I want to know the reason.

torch == 1.6.0, cuda == 10.1

The error message is raised, if you try to write data to an expanded tensor, which didn’t allocate the actual memory, but uses the shape and strides to expand the size.
I don’t know in which context pin_memory=True might raise this error and you might need to check your Dataset and if overlapping memory is used there.
Also, could you update to the latest version and rerun your script?

1 Like

after reading this thread, I am still unclear on these two things.

  1. theoritically, pin_memory=True should speed up things. but, everyone is reporting slow-downs. why is this happening? and what can be done to speed up, hence achieving what is supposed to happen.
  2. @ptrblck has suggested using non_blocking=True in to() operation on the tensors in DataSets and it didn’t seem to work. so, will someone who has tried tell me what is working(speeding things up) and what isn’t?
  1. I don’t think that’s the case and async D2H copies can speed up your code. If that’s not the case you should profile it via the PyTorch profiler or e.g. Nsight Systems to see where the botleneck in your code is and if even an async data copy is expected to speed up the use case. You should also check the overall memory usage of your system and make sure you are not holding “too much” page-locked memory as your OS and all other applications won’t be able to use this memory anymore and their performance might suffer.

  2. Could you explain what didn’t seem to work? Did you profile the workload and were seeing copies into pageable memory instead of pinned memory?

1 Like

This is my collate function for the data loader:

device = "cuda"
def collate_fn(batch):
    x = torch.stack([torch.from_numpy(item[:-1]) for item in batch])
    y = torch.stack([torch.from_numpy(item[1:]) for item in batch])
    if device == "cuda":
        return x.to(device, non_blocking=True), y.to(device, non_blocking=True)
    else:
        return x.to(device), y.to(device)

And Initialised the data loader with pin_memory=True:

train_dataloader = DataLoader(train_dataset, batch_sampler=train_batch_sampler, collate_fn=collate_fn, pin_memory=True)

it gives me RuntimeError: cannot pin 'torch.cuda.IntTensor' only dense CPU tensors can be pinned this error, after i found out it’s copying the host data to the pinned array after collate_fn returns the tensors which is moved to gpu already.

And i changed the collate fn to pin the array to cpu and then move the tensor to device and set pin_memory to False.

device = "cuda" 
def collate_fn(batch):
    x = torch.stack([torch.from_numpy(item[:-1]) for item in batch])
    y = torch.stack([torch.from_numpy(item[1:]) for item in batch])
    if device == "cuda":
        return x.pin_memory().to(device, non_blocking=True), y.pin_memory().to(device, non_blocking=True)
    else:
        return x.to(device), y.to(device)
train_dataloader = DataLoader(train_dataset, batch_sampler=train_batch_sampler, collate_fn=collate_fn, pin_memory=False)

And I checked if it works by

x, y = next(train_loader)
print(x.is_pinned())
print(x.device)

but it gives me this

False
cuda:0

What I expected:

True
cuda:0

it doesn’t pin the tensor but moved the tensor to cuda device. i wanted it to first allocate a host array and transfer the data from the pinned array to device memory. Is there anything im doing wrong? or is there anything i should do to accomplish that?