Torch.save() error on larger images

Hello! I keep getting an error when attempting to save my CNN model. I do not have this problem for smaller images like the 32x32 cifar dataset; however, my images are 448x672 (note: a multiple of 224). I am using the model for a regression task. Any help would be much appreciated!!
python 3.7.5
pytorch 1.6.0
Anaconda

Here is my model:

class Network_CNN_batchNorm(nn.Module):
    def __init__(self):
        super(Network_CNN_batchNorm, self).__init__()
        # 3x448x672 input image (RGB)
        self.layer1 = nn.Sequential(
            # input is 3 channels (RGB) - first parameter
            # 64 filters of kernel size 5x5; padding = kernel_size/2 - 1;
            nn.Conv2d(3, 64, kernel_size=3, stride=1, padding=1),
            # max pooling with stride=2 makes output image 224x336
            nn.MaxPool2d(kernel_size=2, stride=2),
            nn.BatchNorm2d(64))
        self.layer2 = nn.Sequential(
            # 2nd layer uses 128 channels (filters) of 3x3
            nn.Conv2d(64, 128, kernel_size=3, stride=1, padding=1),
            nn.ReLU(), 
            nn.BatchNorm2d(128))
            # 3rd layer uses 128 channels (filters) of 3x3
            # output feature map is still 224x336
        self.layer3 = nn.Sequential(
            nn.Conv2d(128, 128, kernel_size=3, stride=1, padding=1),
            nn.ReLU(),
            nn.BatchNorm2d(128))
        # Average Pooling Layer, 112x168 output
        self.avgP1 = nn.AvgPool2d(kernel_size=3, stride=2, padding=1)
        # Fully connected layers
        self.fc1 = nn.Linear(112 * 168 * 128, 1000) 
        self.fc2 = nn.Linear(1000, 10) # 10 outputs
        
    def forward(self, x):
        out = self.layer1(x)
        out = self.layer2(out)
        out = self.layer3(out)
        out = self.avgP1(out)
        out = out.reshape(out.size(0), -1) # flatten
        out = self.fc1(out)
        out = self.fc2(out)
        return out

Note that I can see my training and validation loss decrease over multiple epochs so training the model appears to be fine. However, I do notice that when training on my CPU the memory usage is around 30-40Gb which seems excessive.

The code for saving the model is shown, and I can confirm that the path is OK since it works with smaller image sizes.

torch.save(model.state_dict(), os.path.join(Model_Path, 'epoch-{}.pth'.format(epoch)))

The error I am getting is as follows:

  File "C:\my.py", line 526, in <module>
    model_trained, t_loss, v_loss = train_model(model, criterion, optimizer, trainloader, testloader, num_epochs)

  File "C:\my.py", line 356, in train_model
    torch.save(model.state_dict(), os.path.join(Model_Path, 'epoch-{}.pth'.format(epoch)))

  File "C:\Users\...\anaconda3\envs\TF2.0\lib\site-packages\torch\serialization.py", line 364, in save
    _save(obj, opened_zipfile, pickle_module, pickle_protocol)

  File "C:\Users\...\anaconda3\envs\TF2.0\lib\site-packages\torch\serialization.py", line 477, in _save
    zip_file.write_record(name, storage.data_ptr(), num_bytes)

TypeError: write_record(): incompatible function arguments. The following argument types are supported:
    1. (self: torch._C.PyTorchFileWriter, arg0: str, arg1: str, arg2: int) -> None
    2. (self: torch._C.PyTorchFileWriter, arg0: str, arg1: int, arg2: int) -> None

Invoked with: <torch._C.PyTorchFileWriter object at 0x0000026AD2154D30>, 'data/2657683100064', 2657910136960, -7546077184

Hi,

We did some fix recently on this. Does it still happen if you use the nightly build?

Thank you for responding. I installed pytorch-nightly (1.8.0.dev20201113) and still have the same error when trying to save the model.

Ok, thanks!
Can you check the size of all the Tensors that are in your model state dict and report them here? I guess one of them is going to be huge?
Note that if you have a case where you can do

t = torch.rand(your_tensor_size)
torch.save(t, "my_path.pth")

it would be super helpful and we should open an issue on github.

From looking from afar, it looks like the issue is with some integer overflow because some of your objects are too big. But if you have a simple repro, we can verify that!

Thank you. Is this what you are asking for?

for param_tensor in model.state_dict():
            print(param_tensor, "\t", model.state_dict()[param_tensor].size())

Model's state_dict:
layer1.0.weight 	 torch.Size([64, 3, 3, 3])
layer1.0.bias 	 torch.Size([64])
layer1.2.weight 	 torch.Size([64])
layer1.2.bias 	 torch.Size([64])
layer1.2.running_mean 	 torch.Size([64])
layer1.2.running_var 	 torch.Size([64])
layer1.2.num_batches_tracked 	 torch.Size([])
layer2.0.weight 	 torch.Size([128, 64, 3, 3])
layer2.0.bias 	 torch.Size([128])
layer2.2.weight 	 torch.Size([128])
layer2.2.bias 	 torch.Size([128])
layer2.2.running_mean 	 torch.Size([128])
layer2.2.running_var 	 torch.Size([128])
layer2.2.num_batches_tracked 	 torch.Size([])
layer3.0.weight 	 torch.Size([128, 128, 3, 3])
layer3.0.bias 	 torch.Size([128])
layer3.2.weight 	 torch.Size([128])
layer3.2.bias 	 torch.Size([128])
layer3.2.running_mean 	 torch.Size([128])
layer3.2.running_var 	 torch.Size([128])
layer3.2.num_batches_tracked 	 torch.Size([])
fc1.weight 	 torch.Size([1000, 2408448])
fc1.bias 	 torch.Size([1000])
fc2.weight 	 torch.Size([10, 1000])
fc2.bias 	 torch.Size([10])
1 Like

Note: I reduced my image size by half in both dimensions:

transforms.Resize((224,336),interpolation=Image.NEAREST)

and was able to save the model. The saved model size is 2.3GB in size just at this image size!

I’m also able to increase the image size to 400x600 and save the model at 7.5Gb in size.

So, running on my CPU allows me the flexibility to use 64Gb of RAM but all four of my GPUs (2070S) have only 8Gb of memory. I have a fairly simple CNN model with 3 layers but the memory requirements are so large with 400x600 images that I cannot train it on my GPUs. I’m working with a regression problem and would prefer to maintain the resolution of my images. What is done in practice for larger image sizes and GPU memory limitations? With my current CPU processor, I would have to wait several days to train a model :crazy_face:

I think your first fully connected layer is a bit big no? the weight size is `torch.Size([1000, 2408448]). Meaning that the input feature size is more than 2 million!
I think that you should reduce the size of that layer and it will help drastically with memory usage.
You can add extra pooling or striding in the last convs to reduce this size.