RunTime Error does not make sense

Hello,

I have defined a densenet architecture in PyTorch to use it on training data consisting of 15000 samples of 128x128 images. When I want to train a densenet network, I get this error-stack:

RuntimeError                              Traceback (most recent call last)
<ipython-input-40-6dace5fb9ac5> in <module>
    118             labels = batch[1].float().to(device)
    119 
--> 120             preds = network(images) # Pass Batch
    121 #             preds_dev = network(images_dev)
    122 

C:\ProgramData\Anaconda3\lib\site-packages\torch\nn\modules\module.py in _call_impl(self, *input, **kwargs)
    720             result = self._slow_forward(*input, **kwargs)
    721         else:
--> 722             result = self.forward(*input, **kwargs)
    723         for hook in itertools.chain(
    724                 _global_forward_hooks.values(),

<ipython-input-7-652ef2b441c1> in forward(self, x)
     81 
     82     def forward(self, x):
---> 83         out = self.relu(self.lowconv(x))
     84 
     85         out = self.denseblock1(out)

C:\ProgramData\Anaconda3\lib\site-packages\torch\nn\modules\module.py in _call_impl(self, *input, **kwargs)
    720             result = self._slow_forward(*input, **kwargs)
    721         else:
--> 722             result = self.forward(*input, **kwargs)
    723         for hook in itertools.chain(
    724                 _global_forward_hooks.values(),

C:\ProgramData\Anaconda3\lib\site-packages\torch\nn\modules\conv.py in forward(self, input)
    417 
    418     def forward(self, input: Tensor) -> Tensor:
--> 419         return self._conv_forward(input, self.weight)
    420 
    421 class Conv3d(_ConvNd):

C:\ProgramData\Anaconda3\lib\site-packages\torch\nn\modules\conv.py in _conv_forward(self, input, weight)
    413                             weight, self.bias, self.stride,
    414                             _pair(0), self.dilation, self.groups)
--> 415         return F.conv2d(input, weight, self.bias, self.stride,
    416                         self.padding, self.dilation, self.groups)
    417 

RuntimeError: [enforce fail at ..\c10\core\CPUAllocator.cpp:72] data. DefaultCPUAllocator: not enough memory: you tried to allocate 536870912 bytes. Buy new RAM!

However, it does not make sense to me: 536870912 bytes = 512 Mb. The total RAM memory that I have is 64 Gb, and the task manager shows that only ~1 % of that memory is utilized. Would anybody know what is going on here and how I can tell PyTorch that there is enough memory?

Nevertheless, some of the tabs in Chrome are not being able to be sustained throwing a message “Not enough memory to open this page. Error code: Out of Memory” or “Can’t open this page. Error code: Crashpad_HandlerDiDNotRespond”, so essentially it agrees with PyTorch. But still numbers do not add up!

The error message could raise the failed memory allocation (not the total allocations).
Of also other programs are reporting an out of memory issue, I would this is indeed the case for your workflow. I don’t know, why the task manager doesn’t show the correct memory usage though.

However, you could try to check the memory usage e.g. via psutil.

Thank you very much for your reply. Here I will just include the code that I use, so the workflow would be more or less clear. I will look into psutils and report the results later.

I define the DenseNet-architecture as follows:

class Dense_Block(nn.Module):
    def __init__(self, in_channels):
        super(Dense_Block, self).__init__()

        self.relu = nn.ReLU(inplace = True)
        self.bn = nn.BatchNorm2d(num_features = in_channels)

        self.conv1 = nn.Conv2d(in_channels = in_channels, out_channels = 32, kernel_size = 3, stride = 1, padding = 1)
        self.conv2 = nn.Conv2d(in_channels = 32, out_channels = 32, kernel_size = 3, stride = 1, padding = 1)
        self.conv3 = nn.Conv2d(in_channels = 64, out_channels = 32, kernel_size = 3, stride = 1, padding = 1)
        self.conv4 = nn.Conv2d(in_channels = 96, out_channels = 32, kernel_size = 3, stride = 1, padding = 1)
        self.conv5 = nn.Conv2d(in_channels = 128, out_channels = 32, kernel_size = 3, stride = 1, padding = 1)

    def forward(self, x):

        bn = self.bn(x)
        conv1 = self.relu(self.conv1(bn))

        conv2 = self.relu(self.conv2(conv1))
        c2_dense = self.relu(torch.cat([conv1, conv2], 1))

        conv3 = self.relu(self.conv3(c2_dense))
        c3_dense = self.relu(torch.cat([conv1, conv2, conv3], 1))

        conv4 = self.relu(self.conv4(c3_dense))
        c4_dense = self.relu(torch.cat([conv1, conv2, conv3, conv4], 1))

        conv5 = self.relu(self.conv5(c4_dense))
        c5_dense = self.relu(torch.cat([conv1, conv2, conv3, conv4, conv5], 1))

        return c5_dense

class Transition_Layer(nn.Module):
    def __init__(self, in_channels, out_channels):
        super(Transition_Layer, self).__init__()

        self.relu = nn.ReLU(inplace = True)
        self.bn = nn.BatchNorm2d(num_features = out_channels)
        self.conv = nn.Conv2d(in_channels = in_channels, out_channels = out_channels, kernel_size = 1, bias = False)
        self.avg_pool = nn.AvgPool2d(kernel_size = 2, stride = 2, padding = 0)

    def forward(self, x):

        bn = self.bn(self.relu(self.conv(x)))
        out = self.avg_pool(bn)

        return out

class DenseNet(nn.Module):
    def __init__(self, nr_classes):
        super(DenseNet, self).__init__()

        self.lowconv = nn.Conv2d(in_channels = 1, out_channels = 64, kernel_size = 7, padding = 3, bias = False)
        self.relu = nn.ReLU()

        # Make Dense Blocks
        self.denseblock1 = self._make_dense_block(Dense_Block, 64)
        self.denseblock2 = self._make_dense_block(Dense_Block, 128)
        self.denseblock3 = self._make_dense_block(Dense_Block, 128)

        # Make transition Layers
        self.transitionLayer1 = self._make_transition_layer(Transition_Layer, in_channels = 160, out_channels = 128)
        self.transitionLayer2 = self._make_transition_layer(Transition_Layer, in_channels = 160, out_channels = 128)
        self.transitionLayer3 = self._make_transition_layer(Transition_Layer, in_channels = 160, out_channels = 64)

        # Classifier
        self.bn = nn.BatchNorm2d(num_features = 64)
        self.pre_classifier = nn.Linear(64*16*16, 512)
        self.classifier = nn.Linear(512, nr_classes)

    def _make_dense_block(self, block, in_channels):
        layers = []
        layers.append(block(in_channels))
        return nn.Sequential(*layers)

    def _make_transition_layer(self, layer, in_channels, out_channels):
        modules = []
        modules.append(layer(in_channels, out_channels))
        return nn.Sequential(*modules)

    def forward(self, x):
        out = self.relu(self.lowconv(x))

        out = self.denseblock1(out)
        out = self.transitionLayer1(out)

        out = self.denseblock2(out)
        out = self.transitionLayer2(out)

        out = self.denseblock3(out)
        out = self.transitionLayer3(out)

        out = self.bn(out)
#         print(out.shape)
        out = out.reshape(-1, 64*16*16)

        out = self.pre_classifier(out)
        out = self.classifier(out)

        return out

Then I define my Dataset class:

class MyDataset(Dataset):
    def __init__(self, images, n, labels=None, transforms=None):
        self.X = images
        self.y = labels
        self.n = n
        self.transforms = transforms
         
    def __len__(self):
        return (len(self.X))
    
    def __getitem__(self, i):
        data = self.X.iloc[i, :]
#         print(data.shape)
        data = np.asarray(data).astype(np.float).reshape(1,n,n)
        
        if self.transforms:
            data = self.transforms(data).reshape(1,n,n)
            
        if self.y is not None:
            y = self.y.iloc[i,:]
#             y = np.asarray(y).astype(np.float).reshape(2*n+1,) # for 257-vector of labels
            y = np.asarray(y).astype(np.float).reshape(128,) # for 128-vector of labels
            return (data, y)
        else:
            return data

Then I create the instances of the train, dev, and test data:

train_data = MyDataset(train_images, n, train_labels, None)
dev_data = MyDataset(dev_images, n, dev_labels, None)
test_data = MyDataset(test_images, n, test_labels, None)

The shapes of train_images, dev_images and test_images are respectively (15000, 16384), (4000, 16384) and (1000, 16384). So there are in total 20000 samples of 128x128 (=16384) images.

The shapes of train_labels, dev_labels and test_labels are respectively (15000, 128), (4000, 128) and (1000, 128). So there are in total 20000 samples of 128 vectors.

I define also a custom loss function:

class Loss():    
    def __init__(self,yHat,y):
        self.n_samples = yHat.size()[0]
        self.n_points = yHat.size()[1]
        self.preds = yHat
        self.labels = y
        self.size = yHat.size()[0]*yHat.size()[1]
        self.diff = yHat - y
        
    def Huber(self,delta=1.):
        return torch.sum(torch.where(torch.abs(self.diff) < delta,.5*self.diff**2 , delta*(torch.abs(self.diff)-.5*delta**2))) / self.size

Then I create an instance of the model:

densnet = DenseNet(nr_classes=128).float().to('cpu')

Then I initialize parameters, create train- and dev-set dataloaders, and train the model using Adam optimizer and Huber loss-function:

nn.init.kaiming_uniform_(list(densenet.parameters())[0], nonlinearity = 'relu')
loader = DataLoader(train_data,batch_size=128,shuffle=False,num_workers=0)
loader_dev = DataLoader(dev_data,batch_size=10,shuffle=None,num_workers=0)
N_epochs = 10
for epoch in range(N_epochs):
      optimizer = optim.Adam(densenet.parameters(), lr=.001, betas=(0.9, 0.999), eps=1e-08)
      for batch in loader:
            images = batch[0].float().to('cpu')
            labels = batch[1].float().to('cpu')
            preds = densenet(images)
            loss = Loss(preds,labels).Huber()

            loss_dev = 0
            for batch_dev in loader_dev:
                images_dev = batch_dev[0].float().to('cpu')
                labels_dev = batch_dev[1].float().to('cpu')
                preds_dev = densenet(images_dev)
                loss_ = Loss(preds_dev,labels_dev).Huber()
                loss_dev += loss_

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            

I checked the memory usage with psutils and it did not show any anomalies either.

Here is the result before the training process started:

Number of logical CPUs:  24
Number of usable CPUs:  24

System-wide CPU utilization in percents
[0.0, 0.0, 3.1, 0.0, 3.1, 0.0, 0.0, 3.0, 3.1, 0.0, 0.0, 0.0, 0.0, 0.0, 4.6, 0.0, 3.1, 0.0, 0.0, 0.0, 0.0, 0.0, 4.6, 0.0]

scputimes(user=827.953125, system=295.078125, idle=143858.734375, interrupt=25.1875, dpc=10.765625)
scpustats(ctx_switches=26253563, interrupts=26349294, soft_interrupts=0, syscalls=95250029)

CPU frequency
[scpufreq(current=0.0, min=0.0, max=3793.0)]

Average system load (last 1,5,15 min): (0.0, 0.0, 0.0)

Virtual memory: svmem(total=68661460992, available=59425591296, percent=13.5, used=9235869696, free=59425591296)
Swap memory: sswap(total=131090264064, used=12851814400, free=118238449664, percent=9.8, sin=0, sout=0)

p.name: python.exe
p.cpu_times: pcputimes(user=183.859375, system=5.84375, children_user=0.0, children_system=0.0)
p.cpu_percent: 0.0
p.create_time: 1597084049.2833054
p.ppid: 4692
p.status: running

p.memory_info: pmem(rss=3085524992, vms=4347408384, num_page_faults=2529304, peak_wset=5924450304, wset=3085524992, peak_paged_pool=4022904, paged_pool=4000824, peak_nonpaged_pool=1136144, nonpaged_pool=243520, pagefile=4347408384, peak_pagefile=7221661696, private=4347408384)

p.memory_full_info: pfullmem(rss=3085524992, vms=4347408384, num_page_faults=2529304, peak_wset=5924450304, wset=3085524992, peak_paged_pool=4022904, paged_pool=4000824, peak_nonpaged_pool=1136144, nonpaged_pool=243520, pagefile=4347408384, peak_pagefile=7221661696, private=4347408384, uss=3023179776)

p.num_threads: 11
p.cpu_affinity: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]

And here is the result retrieved in the end of the first iteration/epoch (the training cant reach the second iteration due the memory issue stated above):

Number of logical CPUs:  24
Number of usable CPUs:  24

System-wide CPU utilization in percents
[18.8, 0.0, 18.7, 0.0, 18.8, 0.0, 9.2, 10.9, 9.4, 10.8, 0.0, 18.8, 18.8, 0.0, 18.8, 0.0, 18.8, 0.0, 17.2, 0.0, 17.2, 0.0, 0.0, 0.0]

scputimes(user=832.453125, system=296.140625, idle=144024.65625, interrupt=25.203125, dpc=10.765625)
scpustats(ctx_switches=26279544, interrupts=26383509, soft_interrupts=0, syscalls=101678318)

CPU frequency
[scpufreq(current=0.0, min=0.0, max=3793.0)]

Average system load (last 1,5,15 min): (0.0, 0.0, 0.0)

Virtual memory: svmem(total=68661460992, available=59290755072, percent=13.6, used=9370705920, free=59290755072)
Swap memory: sswap(total=131090264064, used=13011005440, free=118079258624, percent=9.9, sin=0, sout=0)

p.name: python.exe
p.cpu_times: pcputimes(user=186.6875, system=6.796875, children_user=0.0, children_system=0.0)
p.cpu_percent: 0.0
p.create_time: 1597084049.2833054
p.ppid: 4692
p.status: running

p.memory_info: pmem(rss=3218452480, vms=4490510336, num_page_faults=2613072, peak_wset=5924450304, wset=3218452480, peak_paged_pool=4022904, paged_pool=4000824, peak_nonpaged_pool=1136144, nonpaged_pool=248144, pagefile=4490510336, peak_pagefile=7221661696, private=4490510336)

p.memory_full_info: pfullmem(rss=3218452480, vms=4490510336, num_page_faults=2613072, peak_wset=5924450304, wset=3218452480, peak_paged_pool=4022904, paged_pool=4000824, peak_nonpaged_pool=1136144, nonpaged_pool=248144, pagefile=4490510336, peak_pagefile=7221661696, private=4490510336, uss=3154444288)

p.num_threads: 22
p.cpu_affinity: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]

And here is the code for generating this info:

def dump_cpu_memory_info():
    print("Number of logical CPUs: ",psutil.cpu_count())
    print("Number of usable CPUs: ",len(psutil.Process().cpu_affinity()))
    print()
    print('System-wide CPU utilization in percents')
    print(psutil.cpu_percent(interval=1,percpu=True))
    print()
    print(psutil.cpu_times())
    print(psutil.cpu_stats())
    print()
    print("CPU frequency")
    print(psutil.cpu_freq(percpu=True))
    print()
    print("Average system load (last 1,5,15 min):",psutil.getloadavg())
    print()
    print('Virtual memory:',psutil.virtual_memory())
    print('Swap memory:',psutil.swap_memory())
    print()

    p = psutil.Process()
    with p.oneshot():
        print("p.name:",p.name())
        print("p.cpu_times:",p.cpu_times())
        print("p.cpu_percent:",p.cpu_percent())
        print("p.create_time:",p.create_time())
        print("p.ppid:",p.ppid())
        print("p.status:",p.status())
        print()
        print("p.memory_info:",p.memory_info())
        print()
        print("p.memory_full_info:",p.memory_full_info())
        print()
        print("p.num_threads:",p.num_threads())
        print("p.cpu_affinity:",p.cpu_affinity())

I’m unfortunately not familiar enough with Windows and don’t know what might cause this issue.
A quick search pointed me towards this long description, which might be helpful.

1 Like

Thanks for the link! I learned something new from that.
I accidentally managed to make it work by reconnecting the kernel of the Jupyter notebook. Seems like it was the kernel who got confused on what to do.

However, I encountered another issue. When I set up the num_workers to more than 0 for the loader variable, the training get frozen and doesn’t want to proceed. I have waited for quite a while already and nithong seemingly happens. Would you have some idea what it could be?

Thanks for the information. That sounds indeed like very unexpected behavior.

Are you using the if-clause protection as given in the Windows FAQ?

I tried using the if-clause protection, however, the behaviour is the same: it is just being frozen within the first iteration and does not continue firther. Here is the code.

import multiprocessing
...
def main():
    nn.init.kaiming_uniform_(list(densenet.parameters())[0], nonlinearity = 'relu')
    loader = DataLoader(train_data,batch_size=128,shuffle=False,num_workers=0)
    loader_dev = DataLoader(dev_data,batch_size=10,shuffle=None,num_workers=0)
    N_epochs = 10
    for epoch in range(N_epochs):
          optimizer = optim.Adam(densenet.parameters(), lr=.001, betas=(0.9, 0.999), eps=1e-08)
          for batch in loader:
                images = batch[0].float().to('cpu')
                labels = batch[1].float().to('cpu')
                preds = densenet(images)
                loss = Loss(preds,labels).Huber()

                with torch.no_grad():
                      loss_dev = 0
                      for batch_dev in loader_dev:
                          images_dev = batch_dev[0].float().to('cpu')
                          labels_dev = batch_dev[1].float().to('cpu')
                          preds_dev = densenet(images_dev)
                          loss_ = Loss(preds_dev,labels_dev).Huber()
                          loss_dev += loss_

                optimizer.zero_grad()
                loss.backward()
                optimizer.step()

if __name__ == '__main__':
    multiprocessing.freeze_support()
    main()

I am also using a Jupyter notebook if this gives any clue.

Since the previous issue seems to have been related to the Jupyer kernel, could you try to run the script in a terminal in order to remove some potential Jupyter issues?

Thanks for the advice! I tried to run it in Anaconda terminal and in Spyder, and after waiting for a very long time, I was getting the following error messages:

Traceback (most recent call last):

  File "D:\Jupiter_playground\fashion_mnist_tidied.py", line 1134, in <module>
    main()

  File "D:\Jupiter_playground\fashion_mnist_tidied.py", line 1064, in main
    for batch in loader: # Get Batch

  File "C:\ProgramData\Anaconda3\lib\site-packages\torch\utils\data\dataloader.py", line 291, in __iter__
    return _MultiProcessingDataLoaderIter(self)

  File "C:\ProgramData\Anaconda3\lib\site-packages\torch\utils\data\dataloader.py", line 737, in __init__
    w.start()

  File "C:\ProgramData\Anaconda3\lib\multiprocessing\process.py", line 121, in start
    self._popen = self._Popen(self)

  File "C:\ProgramData\Anaconda3\lib\multiprocessing\context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)

  File "C:\ProgramData\Anaconda3\lib\multiprocessing\context.py", line 326, in _Popen
    return Popen(process_obj)

  File "C:\ProgramData\Anaconda3\lib\multiprocessing\popen_spawn_win32.py", line 93, in __init__
    reduction.dump(process_obj, to_child)

  File "C:\ProgramData\Anaconda3\lib\multiprocessing\reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)

OSError: [Errno 22] Invalid argument
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "C:\ProgramData\Anaconda3\lib\multiprocessing\spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "C:\ProgramData\Anaconda3\lib\multiprocessing\spawn.py", line 126, in _main
    self = reduction.pickle.load(from_parent)
_pickle.UnpicklingError: pickle data was truncated

I will wait much longer and see if a similar message will appear in Jupyter as well.
Would you have any ideas on what the issue is?

The UnpicklingError error might be raised, if pickle isn’t able to load a specific fine, if it the binary file is corrupt (e.g. due to a failed download).
If that’s the case, disabling shuffle and rerunning the script should yield the error at the same index.

However, based on the previous error message (OSError: [Errno 22] Invalid argument) it seems as if a wrong path was used to try to load a file.

@ptrblck Thats interesting, and Im not sure how much sense it makes.

I tried with shuffle=False, but the behaviour is still the same.
Also, I would think that since it works with num_workers=0 then files are loaded successfully in this case.

I added some print functions around for-loop as follows:

            print("Before batch")
            for batch in loader: # Get Batch
                print("After batch")

For num_workers>0 the line Before batch is being printed, while the line After batch is not.
Could it be related to some peculiarities in AMD processors (my case), compared to Intel?
Are there other tests I could perform to narrow down the issue?

It’s hard to tell, if the issue might be CPU or OS - dependent (or might be triggered by any other issue).
What is concerning is that the previous issue was apparently solved after using the terminal instead of Jupyter, your local environment might also be “broken”.

Could you create a new virtual environment, reinstall PyTorch, and rerun the script?
If that doesn’t help, could you try to execute your script on Colab, which would use a notebook, but should use a Linux-based OS, if I’m not mistaken.

Sure, thanks for the ideas! I will keep reporting on how it goes.

@ptrblck, I have created a virtual environment in Anaconda, reinstalled PyTorch and rerun the script and using Jupyter with this new environment. I would say the effect is the same.

Although, while playing around I noticed the following. I am iterating (in a for-loop) the parameter num_workers, and before I was iterating starting from num_workers=0 and increasing it with next iterations. So it worked for num_workers=0 and then the training was getting frozen for next iterations.

This time I was iterating num_workers in a descending order, say, starting from 4 down to 0 (specifically, 4,2,1,0). And it works for the first iteration, i.e. for num_workers=4 and then it gets frozen at the next iteration.

Then I tried Spyder and seems there iterating through num_workers works, however, plenty of similar warning messages are being thrown into the console:

C:\Users\Admin\.conda\envs\pytorch_env\lib\site-packages\numpy\core\_asarray.py:83: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
  return array(a, dtype, copy=False, order=order)

I also tried Colab and everything works flawlessly (in terms of different values of num_workers being iterated in a for-loop).

Is there anything else I can do to get it working on my machine?

It seems Jupyter is again failing, while Spyder seems to work?
If that’s the case, your best option would be to try to come up with a small multiprocessing example and file an issue against Jupyer. I’m not sure, if there is much more you can do, as the issue sounds like a bad interaction in the notebook with multiprocessing.

The numpy warning points towards a nested array with variable lengths. Could you check, where this warning is created and try to fix it? I don’t think it’s related to the issue you are seeing, but might be worth a try.

@ptrblck Thank you very much for your support! It really helped!

Did you figure out the issue? If so, could you post a quick update as it would be really interesting to see what might be the issue.

I will get back to it a bit later and post here once there is some progress. But it seems for now that I should correct for the last posted Warning-message, and then whenever I want to change the number of workers, I should restart Python/Windows. Another solution would be installing Linux, since Google Colab worked well.