Efficient GPU Data Movement?

Normally I move all of my data to the GPU before training, but I have a dataset that is too big for the GPU’s memory (it easily fits into system RAM). So rather than moving the entire dataset to the GPU, I changed by code to only move the mini batches to the GPU as needed. I’ve done some testing with datasets that fit entirely on the GPU, and I find that moving the mini batches to the GPU as needed greatly increases training time (about 1.5x for my data). I thought that if I moved larger chucks of data to the GPU at a time and then split the chucks into mini batches, it might help. However, this seems to be twice as slow as just moving the mini batches as needed. Any suggestions on a better way to move data to the GPU would be appreciated.

Original Approach:

    X = torch.from_numpy(X).cuda()
    y = torch.from_numpy(y).cuda()
    tensor_dataset = torch.utils.data.TensorDataset(X, y)
    train_loader = torch.utils.data.DataLoader(tensor_dataset, 256, shuffle=True)
    for epoch in range(1, n_epochs+1):
        for batch, (data, target) in enumerate(train_loader, 1):
            model.train()
            optimizer.zero_grad()
            outputs = model.forward(data)
            loss = loss_function(outputs, target)
            loss.backward()
            optimizer.step()

Mini Batch Approach:

    X = torch.from_numpy(X)
    y = torch.from_numpy(y)
    tensor_dataset = torch.utils.data.TensorDataset(X, y)
    train_loader = torch.utils.data.DataLoader(tensor_dataset, 256, shuffle=True)
    for epoch in range(1, n_epochs+1):
        for batch, (data, target) in enumerate(train_loader, 1):
            data = data.cuda()
            target = target.cuda()
            model.train()
            optimizer.zero_grad()
            outputs = model.forward(data)
            loss = loss_function(outputs, target)
            loss.backward()

Chunked Approach:

    X = torch.from_numpy(X)
    y = torch.from_numpy(y)
    tensor_dataset = torch.utils.data.TensorDataset(X, y)
    chuck_loader = torch.utils.data.DataLoader(tensor_dataset, 100000, shuffle=True)
    for epoch in range(1, n_epochs+1):
        for chunk_data, chunk_target in chuck_loader:
            chunk_data = chunk_data.cuda()
            chunk_target = chunk_target.cuda()
            tmp_tensor_dataset = torch.utils.data.TensorDataset(chuck_data, chuck_target)
            train_loader = torch.utils.data.DataLoader(tmp_tensor_dataset, 256, shuffle=False)
            for batch, (data, target) in enumerate(train_loader, 1):
                model.train()
                optimizer.zero_grad()
                outputs = model.forward(data)
                loss = loss_function(outputs, target)
                loss.backward()
                optimizer.step()

Could you try to use multiple workers in your DataLoader (num_workers>0) as well as pin_memory=True and data = data.to('cuda', non_blocking=True)?

If the problem really is data movement, how would setting num_workers help since the DataLoader does not move the data to the GPU?

I’ve considered writing a custom torch.utils.data.Dataset, that moves the data to the GPU before returning it and then using a DataLoader with multiple workers, but this seems like it would cause a lot of memory fragmentation on the GPU.

Based on the description and provided code, I doubt the transfer is the bottleneck, but the data loading, but correct me, if I’m wrong.

Why would this cause a lot of fragmentation?
I wouldn’t recommend this approach in general, as multiprocessing and CUDA calls might be hard to handle.

The data is loaded completely into RAM and stored in X and y initially. The problem is that the data does not fit entirely on the GPU. The only difference between the first block of code and the second is when the data is moved to the GPU. I’ve tested with subsets of my large dataset that will fit into memory and found that the second block of code is about 1.5 times slower than the first.

I just assumed that it would not be the best use of GPU memory to have the torch.utils.data.Dataset move each sample to the GPU when it was requested. Currently, 128 samples are moved to the GPU at a time.

Thanks the for information!
If you are preloading all samples into the RAM, you might also try to simply index the tensors in your loop and push each chunk to the GPU. I assume you don’t apply any transformation etc. in your Dataset.
If that’s the case, the “manual” indexing might be a bit faster, as the DataLoader might add some overhead to load each sample.
Could you try that and check the performance?

Also, you could try to call pin_memory() on the data tensors and use .to("cuda", non_blocking=True).