Runtime Problem with .cuda()

AikeAhrens · September 16, 2019, 10:03am

Since I started to use the pytorch dataloader, I´ve got runtime problems with .cuda(). My dataset consists of 250000 .npy-files each containing a numpy array with the shape 33x27. I’m using the dataloader the following way:

# list containing all file paths
train_file_paths = getPaths(self.dir_training)
            
trainDataSet = IterDataset(feature_path=train_file_paths)
            
train_loader = utils.DataLoader(dataset=trainDataSet,batch_size=32,shuffle=False,num_workers=16,pin_memory=True)

My training loop calls each time a new batch (batch-size: 32) and stores it to the GPU via .cuda(). The model is stored to the GPU at the beginning of the script.

for i,(feature,labels) in enumerate(train_loader):

    feature = Variable(feature.cuda(), requires_grad=True)
    labels = Variable(labels.cuda(), requires_grad=True)

    outputs = model(feature.float())
        
    loss = criterion(outputs,labels.long())

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

My Dataclass looks like the following (the first column of the ndarray is the label):

class IterDataset(Dataset):

    def __init__(self, feature_path):
        self.feature_path = feature_path

    def __len__(self):
        return len(self.feature_path)

    def __getitem__(self, index):

        feature = np.load(self.feature_path[index])

        X = feature[:,1:] 

        y = feature[0,0]   
        
        # checking for NAN
        if np.isnan(X).any():
            print('NAN'+ self.feature_path[index])

        return X, y

The cProfile for the first 1000 batches:

 Ordered by: internal time

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
2023   13.745    0.007   13.745    0.007 {method 'cuda' of 'torch._C._TensorBase' objects}
1000    1.931    0.002    1.931    0.002 {method 'run_backward' of 'torch._C._EngineBase' objects}

Hardware / Software I’m using:

Cuda Version: 10.1
GPU: 2x GeForce GTX 1080

So if anybody has an idea why .cuda() takes so much time, I would appreciate it.

ptrblck · September 16, 2019, 3:51pm

Since CUDA operations are asynchronous, the host to device copy via .cuda() could create a synchronization point and thus accumulate the timing from the actual forward and backward pass.
If you would like to profile the code manually, you could add manual synchronization points via torch.cuda.synchronize().