Pytorch with RTX 4000, 48Gb VRAM

sigma_x · October 25, 2019, 9:12pm

I work with a very large video dataset (1.5h of video ~ 5000 frames 512x256 each), no GPU currently available to me can handle such amount of data.

We are planning to purchase the above mentioned card. If anyone has experience with running Pytorch models and dataloader objects on it, I’d like to hear. Most importantly, what’s the largest amount of data that can realistically be loaded on it.

The models we will run are ConvNet+LSTM or GRU

ptrblck · October 26, 2019, 11:16am

You could compute the data size simply with 5000*521*265*nb_channels*bytes_per_element.
Besides that your model will of course use memory on the device (parameters + intermediate tensors for backpropagation).

That being said, usually you would use mini-batches instead of loading the complete dataset to the GPU, so save the memory for a larger model.
What’s your use case that you want to load the complete data?

sigma_x · October 26, 2019, 2:53pm

ResNet is ~44M pars, LSTM is ~6M pars. How much VRAM is that?

Assuming nb_channels=3 and byter per element (does element=pixel?) the data volume comes up to ~16Gb.

The video classifiers I’ve seen were trained on datasets like UCF101 with very short video. I have videos that have a few very different scenes in them, and I can’t extract 1 frame/minute, because LSTM won’t be able to learn the context, so it must be roghly 1FPS, which in total comes up to 5K frames, and they have to be large (e.g. 512x512).

1 video=1 data point.

One more thing I don’t quite understand about DataLoader: if I have a minibatch_size=1, once that one datapoint (video) is fed forward, can I remove it from the GPU to load a new one, so the amount of used VRAM remains the same? This didn’t seem to work that way, but maybe I did something wrong?

sigma_x · October 26, 2019, 3:05pm

Also, when the data is fed through the networks, how does this impact VRAM? Is every layer loaded on the GPU on top of the dataloader with the minibatch and the models?

ptrblck · October 26, 2019, 5:09pm

By default the model parameters will be stored as FP32, so each value will take 4 Bytes.
Note that the activation, which are created during the forward pass, will also use some memory.
The activation shape would depend on the input shape and the layer setup.

After training on one mini-batch you will be able to reuse the memory and feed the next mini-batch for training. Did you see a growing memory usage?

The model is usually loaded onto the GPU before training. Each batch given from the DataLoader will be pushed to the GPU in the training loop to keep the memory usage low.

sigma_x · October 29, 2019, 10:26am

The largest I could load on the GPU was 20 frames from 10 videos 128x128 each. Any sort of increase (size or number of frames) lead to CUDA out of memory error.

I tried first batch_size=1 (here 1 data sample is a vector of N frames from 1 video) with a larger number of frames and batch_size=10 with 20 frames. The former always led to CUDA out of memory error, so I decided that somehow the batch stays on GPU (I can’t find another explanation).

ptrblck · October 29, 2019, 12:49pm

How many frames did you use in the former approach?
The memory usage depends also on your model architecture.
E.g. if you are using an RNN, trading the sequence length for the batch size might not result in the same memory usage.