Very slow training 3D CNN

Hi.

I’ve got 4d data of shape [20, 200, 200, 200]. Where 20 is each electrode measurement on different parts of the body from an EEG. The [200, 200, 200] is a custom sparse tensor transformation from a very large topological preprocessed matrix of shape [9501, 3].
The idea is to have each electrode signal as a 3D shape where the [200,200,200] represents a sparse tensor of 2D slices - sliced 200 times to get a 3D shape. I.e, the inherent topology of the underlying system of each electrode.

But now to the challenge I have. Using an RTX3060 12GB, 2 NVMe SSDs, 48GB RAM, i5-9600K, batch_size=1, and full preprocessing in the Dataset module. The full forward/backward step is very slow. Maximum I’ve tried is batch_size=4 but I have 17000 EEG parquet files to train on.

EDIT: I have now reduced the dimensions into [20, 40, 40, 40] with loss of granularity. Now for each .parquet file it takes 1 second/file per epoch for batch_size=8. If I train on 3600 files for 10 epochs it would take 10 hours. Is this normal?

What can be done besides hardware?

I would suggest to narrow down the bottleneck of your code as it’s unclear if the actual model training on the GPU or e.g. the data loading and processing is causing the slow execution time. Once you have isolated the bottleneck you would have a better idea what to improve.

Thanks for the reply.

What I’ve seen is that if I increase the batch_size or the model parameters then GPU utilization increases but it isn’t much faster. Specifically, increasing batch_size >64 decreases CPU utilization and increases GPU utilization. As I wrote, full preprocessing is done in the Dataset module, casting to GPU in the Dataloader, model is already sent to GPU.

The trade-off between GPU and CPU utilization seem to be connected to the batch_size AND also the varying sizes of the .parquet files. They range from 200 kb to 29 mb.

I have tried to take the smallest 100 files from 200 kb to 250 kb and that is when I see the 1 second/file per epoch. For the larger files, it is >1 second/file per epoch.

Dataset module.

class CustomDataset2(Dataset):
    
    def __init__(self, filepath, annotation_file):
        self.filepath = filepath
        self.annotations = pd.read_csv(annotation_file).query(f"filename in {small_tryout}")
        self.annotations.iloc[:, 2:8] = self.annotations.iloc[:, 2:8].div(self.annotations.iloc[:, 2:8].sum(axis=1), axis=0)
        self.__tf = SingleTakensEmbedding(time_delay=1, dimension=500, stride=1, n_jobs=-1, parameters_type="fixed")
        self.__pca = PCA(n_components=3)
        
        
    def __len__(self):
        return len(self.annotations)
    
    def __getitem__(self, idx):
        eeg_sample_path = self._get_path(idx)
        labels = self._get_sample_label(idx)
        
        eeg = pl.read_parquet(eeg_sample_path).fill_null(0)
        eeg = self._transform(eeg)
        eeg = self._clean_and_normalize(eeg)
        
        
        return torch.tensor(eeg, dtype=torch.float16), torch.tensor(labels, dtype=torch.float16)
        
    def _get_path(self, idx):
        path = os.path.join(self.filepath, self.annotations.iloc[idx, 8])
        return path
    
    def _get_sample_label(self, idx):
        return self.annotations.iloc[idx, 2:8].tolist()
    
    
    def _transform(self, eeg):
        
        rows = len(eeg)
        offset = (rows-2500)//2
        eeg = eeg[offset:offset+2500]
        train_arr = np.array(eeg)
        # fill_this2 = np.zeros((20, 9501, 500))
        takens_embedded = [self.__tf.fit_transform(train_arr[:, col]) for col in range(20)]
        fill_this2 = np.stack(takens_embedded, axis=0)
        pcas = [self.__pca.fit_transform( fill_this2[col, :, :] ) for col in range(20)]
        fill_this3 = np.stack(pcas, axis=0)
        
        parsed_tensors = [create_3d_tensor_from_pca(fill_this3[col, :, :], shape=(40,40,40)) for col in range(20)]
        fill_this4 = np.stack(parsed_tensors, axis=0)
        return fill_this4

    
    def _clean_and_normalize(self, tensor):
        
        tt = tensor.copy()
        
        #Fill nans and infinite values    
        mask_nan = np.isnan(tt)
        mask_inf = np.isinf(tt)
        mask_neginf = np.isneginf(tt)
        tt[mask_nan] = 0
        tt[mask_inf] = 0
        tt[mask_neginf] = 0
        
        mu = tt.mean(axis=1, keepdims=True)
        sd = tt.std(axis=1, keepdims=True)
        tt = (tt-mu) / (sd + 1e-6)
        
        # Refill nans and infinite values if std explodes
        mask_nan = np.isnan(tt)
        mask_inf = np.isinf(tt)
        mask_neginf = np.isneginf(tt)
        tt[mask_nan] = 0
        tt[mask_inf] = 0
        tt[mask_neginf] = 0
            
        return tt

CNN model (dirty code I know)

class CNN4D(nn.Module):
    def __init__(self, num_classes):
        super(CNN4D, self).__init__()
        
        self.conv3d = nn.Conv3d(20, out_channels=64, kernel_size=(3,3,3), stride=1, padding=2)
        self.drop3d = nn.Dropout3d(.3)
        self.batchnorm = nn.BatchNorm3d(64)
        self.pool = nn.MaxPool3d(kernel_size=2)
        
        self.conv3d_2 = nn.Conv3d(64, 16, kernel_size=(3,3,3), stride=1, padding=2)
        # self.conv3d_3 = nn.Conv2d(128, 20, kernel_size=3, stride=1, padding=2)
        self.batchnorm2 = nn.BatchNorm3d(16)
        self.flatten = nn.Flatten()
        self.fc1 = nn.Linear(194672, 6)
        
    def forward(self, x):
        
        x = x.float()
        x = F.relu(self.conv3d(x))
        x = self.drop3d(x)
        x = self.batchnorm(x)
        x = self.pool(x)
        
        x = F.relu(self.conv3d_2(x))
        # x = F.relu(self.conv3d_3(x))
        x = self.batchnorm2(x)
        x = self.fc1(self.flatten(x))
        return x