GPU Memory Consumption but GPU util at 0%, How to deal with data Bottlenecks?

Hey everyone,

I’m working on a project that deals with a huge amount of data. I created a custom Dataset for the dataloader but while there’s a huge GPU memory consumption, GPU util is at 0%

this is the Custom Dataset class:

class CustomDataset(Dataset):
    def __init__(self, filenames, batch_size, transform=None):
        self.filenames= filenames
        self.batch_size = batch_size
        self.transform = transform
        
    def __len__(self):
        return int(np.ceil(len(self.filenames) / float(self.batch_size)))   # Number of chunks.
      
    def __getitem__(self, idx): #idx means index of the chunk.
    # In this method, we do all the preprocessing.
    # First read data from files in a chunk. Preprocess it. Extract labels. Then return data and labels.
        batch_x = self.filenames[idx * self.batch_size:(idx + 1) * self.batch_size]   # This extracts one batch of file names from the list `filenames`.
        data = []
        labels = []
        for exp in batch_x:
            temp_data = [] 
            for file in exp: 
                temp =  np.loadtxt(open(file), delimiter=",") # Change this line to read any other type of file
                temp = scaler.fit_transform(temp)
                temp_data.append(temp) # Convert column data to matrix like data with one channel
                
            pattern = exp[0][16:20]      # Pattern extracted from file_name
            for column in label_file:
                if re.match(str(pattern), str(column)): # Pattern is matched against different label_classes
                    labels.append(np.asarray(label_file[column])) 

            data.append(temp_data)
            
        data = np.asarray(data) # Because of Pytorch's channel first convention
        if self.transform:
            data = self.transform(data)
        data = data.reshape(data.shape[0], data.shape[1] * data.shape[2] * data.shape[3])
        labels = np.asarray(labels)
        

        # The following condition is actually needed in Pytorch. Otherwise, for our particular example, the iterator will be an infinite loop.
        # Readers can verify this by removing this condition.
        if idx == self.__len__():  
            raise IndexError

        return data, labels

I decided to batch the data already here but even if I do it in Dataloader it does not change a thing. the filenames is a list of 50k lists each containing 48 csv paths needed for each label that are then flattened.
calling Dataloader with num_workers (os.cpu_count()) and pin_memory=True doesnt change a thing.

when using dummy variables instead the GPU works and when I try to run that function outside to save an .npy files the kernel dies so it’s definitely a bottleneck from the data.

when I try to time the data assignment:

    for i, data in enumerate(train_dataloader, 0):
        start_time = time.time()
        # Get and prepare inputs
        inputs, targets = data
        inputs, targets = inputs.to(dev), targets.to(dev)
        inputs, targets = inputs.float(), targets.float()
        print(time.time()-start_time)

I get :

anyone has an idea of how I can deal with this?

Thanks a lot :slight_smile:

Its more efficient to move as many work as you can safely to the Dataloader subprocesses.

I work in a project that we had a problem like that. You can do everything on the main process using threads, or the internal threadpool implementation, but the GIL overhead makes it not so much efficient.

Another thing you can explore is when you have a lot of small transfers, so you can try passing non_blocking=True so the blocking only happens when you really need that value on the GPU.

One more technique is to mmap all the files you will need to load on runtime so you can use Linux caching mechanism to your advantage, you trade some extra memory usage to having to wait less for your data to load. We run the mmap of all the files we will need when we initialize the program. mmap itself is really fast, data itself will only load on demand.

Increasing batch size helps too but it will use significantly more vram, and ram, and shared ram so your training process may die by out of memory.