Saving Numpy Arrays in the Most Compressed Way

Hello everyone,

I have some large NumPy arrays (4000 , 200 , 200 , 20). I was looking for the best way to save them in the most compressed way. I’ve tried .npz format saving but it takes about 200MB for each. These Numpy arrays would be the inputs to my CNN and due to the large size of the arrays, I am struggling with “Cuda out of memory” problem.

The “CUDA out of memory” issue is raised, if your GPU is running our of memory and is unrelated to the storage of numpy arrays.
Try to reduce the batch size for your input tensors to reduce the memory usage or alternatively use torch.utils.checkpoint to trade compute for memory.

The batch size is set to 1. A large part of it happens when I call my network (Model), before even getting to the forward pass:
In the debugging window:
y = Model(x)
-> Class Model(Module):
def init(self):
# some initializations
def forward (self, input):
# A large part of memory have been used before the forward pass. Here
tensor = torch.max(input, dim=2, keepdim=False)

If the OOM error is raised before the forward pass is executed, this would mean that either your model or the current batch is already too large for your GPU.
Are you pushing the complete dataset to the device?

No, every epoch only one of these arrays (4000,200,200,20) is fed to the network. There are 200 of these arrays in total in the datatset. The network itself is not a large one. It has 10 convolutional layers in total.

I don’t quite understand the use case, as you’ve mentioned the batch size is 1.
Which dimension of your input would correspond to the batch size and which shape are you feeding to the model?
If the mentioned shape is fed in each epoch to the model, when does the OOM error occur?

The dimension of the input is (1, 4000, 200, 200, 20) in which the first one is the batch size. Here is my watchdog window for each command during debugging:
y = Model(x) ######## Watchdog GPU Memory Usage: 1085MiB
-> Class Model(Module):
def init (self):

def forward (self, input):
tensor = input ######## Watchdog: 8279MiB

Seems that OOM happens when the GPU memory usage is 8469MiB in a layer between but the largest part of the memory is used before getting into the forward pass which I guess makes the problem.

Since you’ve mentioned you are using a CNN, I assume you are using nn.Conv3d layers and the input has a channel dimension of 4000?
Could you post your model architecture and post your GPU (including the memory)?

The input is actually like a video in which 4000 is the number of frames, spatial window size is 200 for both horizontal and vertical and 20 is the number of channels. I am using two Tesla V100 PCIe 16GB GPUs.
The model architeture is a stack of two of these layers:

I also tried to test the network without training and the same problem happens during testing.

Is it possible that the size of the input array in each epoch (200MB) cause the OOM problem? Seems that Variable(input.cuda()) causes the problem before getting into the forward pass.

We would still need to see the model to debug it further.
Could you post it with all necessary arguments and shapes to reproduce the OOM?

4000*200*200*20*16bits ~= 6GB, so no mysteries there. 200mb is zipped .npz size, you can’t train a network on that representation.