Cuda always out of memory with a deep multiple instance learning model

Hello everybody!

I try to write and train a deep multiple instance learning model for 3D image classification, I always meet the question of “out of memory”. And cannot find out what leads to this problem.

Since the 3D images are too large, I segment each 3D image into overlapped patches. So my input is an array of 3D patches, and my output is a label.

In the training process, each patch is as the input of a 3D CNN with the Softmax layer, and the 3D CNN outputs probability of each patch. After obtaining the probabilities of all the patches of a 3D image, the deep multiple instance leanring model selects the patches with the max probability to proceed back-propagation. During the process of back-propagation, I always get the problem of “cudaError: out of memory”. My codes are as bellow:


    running_loss, running_corrects = 0,0
    patch_unit_size = 200

    for pat_batch in self.dcm_datasets['train']:
        inputs, labels, data_dir = pat_batch   
       # the shape of inputs are [batch_numr, patch_num, channel_num, path_height, patch_width, patch_length], where batch_num=1, patch_num varies with differente 3D images, channel_num=3, patch_height=patch_width=patch_length=24

        for input_each_batch in inputs:
            patches_size = input_each_batch.shape[0]
            num_patches = math.ceil(1.0 * patches_size / patch_unit_size)

            patch_out_max, patch_out_prob_max = None, None
            for i in range(num_patches):  # Find out the patches with the maximum probability
                inputs_tmp = input_each_batch[i * self.patch_size: (i + 1) *
                          self.patch_size] if i < num_patches - 1 else input_each_batch[i * self.patch_size:]
                with torch.cuda.device(self.cuda_ids[0]):
                    inputs_new = Variable(inputs_tmp.cuda()).to(self.cuda_ids[0])
                    labels_new = Variable(labels.cuda()).to((self.cuda_ids[0]))

                outputs = self.model(inputs_new)  # 1*2
                out_probs = torch.nn.functional.softmax(outputs, dim=1).data
                patch_out_prob_max = out_probs if patch_out_prob_max is None else
                    (out_probs, patch_out_prob_max), dim=0)

                '''find all the indices of the maximum'''
                patch_prob_max = patch_out_prob_max.cpu().numpy()
                inds_x, inds_y = np.where(patch_prob_max == np.max(patch_prob_max))
                patch_out_prob_max = patch_out_prob_max[inds_x]

                inds_x_out = inds_x[inds_x < outputs.shape[0]]
                inds_x_rest = inds_x[inds_x >= outputs.shape[0]]
                if inds_x_rest.shape[0] != 0:
                    inds_x_rest = inds_x_rest - outputs.shape[0]
                    patch_out_max = patch_out_max[inds_x_rest]
                if inds_x_out.shape[0] != 0:
                    patch_out_max_tmp = outputs[inds_x_out]
                    if inds_x_rest.shape[0] != 0:
                        patch_out_max =, patch_out_max_tmp), dim=0)
                        patch_out_max = patch_out_max_tmp

                outputs, out_probs, inputs_new = 0, 0, 0

            labels_new_1 = None
            for i in range(patch_out_max.shape[0]):
                labels_new_1 = labels_new if labels_new_1 is None else, labels_new),dim=0)

            loss = self.criterion(patch_out_max, labels_new_1)
            self.optimizer.zero_grad()  # zero the parameter gradients

            preds = torch.argmax(patch_out_max[0].data)  # preds is still a tensor
            running_loss += loss.item()  # running_loss is a Python data
            running_corrects += np.sum(preds.item() == labels_new.item())

            loss, labels_new_1 = 0, None

    data_len = len(self.dcm_datasets['train'])
    epoch_loss = running_loss / data_len
    epoch_acc = running_corrects / data_len

    return epoch_acc, epoch_loss

Since the patch_num may be higher to 1500, I am wondering whether the “CudaError: out of memory” is caused by the large computation graph? But I am not sure.

So my question is how large is my computation graph? Providing the computation graph of a self.patch_size is O, my computation graph is O or num_patches*O?

If my computation graph is just O, the cuda usage is not too much. In this case, what leads to “cuda out of memory”?


The memory used will be only what you keep reference to. In your case, at the end of the loop, both all the patches that your kept in patch_out_max (which I’m not sure how many is that by just reading your code) and all the patches from the last run of the loop (because python does not delete stuff at the end of a loop).