Does the transform operation operate on all data at once or does it operate once every time batch size data is loaded?

ma3252788 · August 19, 2020, 11:34am

For example, I have a set of 10,000 1080P pictures, I resize them to 224 in transform, batch size=20, when the program is running, do I resize 10,000 pictures to 224 at once or resize 20 img at once?

train_transform = transforms.Compose(
        [transforms.RandomHorizontalFlip(), transforms.RandomCrop(224), transforms.ToTensor(),
         transforms.Normalize(mean, std)])

This question is raised because I have 200,000 1080P pictures. If I use 200 of them to train the network, there is no problem, but if I use these 200,000 pictures for training, the graphics card memory will be insufficient.

ma3252788 · August 19, 2020, 11:35am

I want to know how the transform works

Oli · August 19, 2020, 11:41am

They will typically be resized one by one in your dataset class. It’s also possible to do it per batch in the dataloader’s collate function.

200.000 images will not be resized at the same time.

ma3252788 · August 19, 2020, 11:44am

Thank you, can you tell me why the network can run when 200 sheets are used, but not when 10000 sheets are used?

I always thought that the increase in the data set will only increase the training time. If the batch size remains the same, the GPU usage should also remain the same, right?

Oli · August 19, 2020, 11:50am

Yes, you are correct in that the GPU memory usage should remain the same. If it’s increasing for every batch you most likely have a bug in the program.

Do you have such a problem?

ma3252788 · August 19, 2020, 11:53am

Yes, I have this problem. If I use 200 images for training, it will take about 10 seconds for the GPU to use from 1800MB to 4848MB, but if I choose 10,000 images for training, it will take about 1 hour for the GPU to go from 1800MB to out of memory.

ma3252788 · August 19, 2020, 11:53am

Do you know how to solve it ?

Oli · August 19, 2020, 12:01pm

Yes I think that I can help you

Make sure to do the backpropagation. loss.backward()
Don’t hold on to the image, labels, loss or any other GPU tensor in another list. If you want to save e.g. the loss, save the loss.item() so the GPU can free the memory.

If this won’t solve your problem, please post the code of your training loop

ma3252788 · August 19, 2020, 12:06pm

Thank you so much! Yes , I have loss.backward(), But I don’t quite understand the second point you said, Below is my training loop code：

    def train():
        net.train()
        loss_avg = 0.0
        for batch_idx, (data, target) in enumerate(train_loader):
            data, target = torch.autograd.Variable(data.cuda()), torch.autograd.Variable(target.cuda())

            # forward
            output = net(data)

            # backward
            optimizer.zero_grad()
            loss = F.cross_entropy(output, target)
            loss.backward()
            optimizer.step()

            # exponential moving average
            loss_avg = loss_avg * 0.2 + float(loss) * 0.8

        state['train_loss'] = loss_avg

Oli · August 19, 2020, 12:09pm

Try this code instead

    def train():
        net.train()
        loss_avg = 0.0
        for batch_idx, (data, target) in enumerate(train_loader):
            data, target = data.cuda(), target.cuda()

            # forward
            output = net(data)

            # backward
            optimizer.zero_grad()
            loss = F.cross_entropy(output, target)
            loss.backward()
            optimizer.step()

            # exponential moving average
            loss_avg = loss_avg * 0.2 + loss.item() * 0.8

        state['train_loss'] = loss_avg

ma3252788 · August 19, 2020, 12:18pm

Thanks Reply. I tested it on two GPUs. After changing the train loop, it only took 10 seconds for 200 images to change from 1686MB to 4848MB. The other 10,000 images have not changed until now.

ma3252788 · August 19, 2020, 12:27pm

And I found that when batch size=2, I increased the training data from 200 to 550. When I turned on the train loop, I found that the first time from 1686 to 4848, print(batch_idx) will output 0 ,1,2,274, and then the GPU will become 4848MB. When batch size=4, the output will be 0,1,2,3,137.
I think that GPU memory will increase only when all the data is running.

ma3252788 · August 19, 2020, 12:43pm

This should be abnormal, right? I tested other programs and found that one batch will start to occupy the GPU Instead of using the GPU until all batches are loaded

Oli · August 19, 2020, 12:49pm

It’s a bit hard for me to understand you. What do you think is abnormal?
The batches are supposed to be processed one by one, not all at the same time.

ma3252788 · August 19, 2020, 12:54pm

I mean, the GPU should process one batch of data at a time? But now it seems that in this program, the GPU processes all the input at once, so it’s okay when there are 200 pictures, but when the pictures are 10,000, the memory is exceeded.

Oli · August 19, 2020, 1:01pm

You are correct in that it should process one batch at a time. If it doesn’t that’s a problem. Maybe you have a bug in your dataset/dataloader then? It’s hard to say with the information I have