Training issue the first Epoch took tooo long

Beginwithtorch · December 13, 2023, 8:51pm

Ass you can see at the image, i have problems with the training of the model. The first epoch took while to long.
And my second Problem is, that i just use 3.7 GB of possible 40 GB GPU. #
grafik

I try to take more and less worker and set a higher and lower batch size but it dont help to use more gpu power.

Essoz · December 14, 2023, 6:25am

Larger batch size should be able to increase GPU memory usage, so it is weird that changing it does not affect GPU.

Can you share more information, such as code snippet, data set and your configuration parameter, so that we can help you?

Beginwithtorch · December 14, 2023, 2:58pm

The dataset which I use is the offen source dataset cifar 10. I use the ResNet18 architecture for train.

Which code snippets you need ?

Essoz · December 14, 2023, 4:55pm

Your pipeline for reading data and training the model

Beginwithtorch · January 20, 2024, 10:22pm

def get_loader(args, train_dataset, test_dataset):
    if args.dataset == "imagenet100":
        train_loader = torch.utils.data.DataLoader(
            train_dataset, batch_size=args.bs, shuffle=True, num_workers=4, drop_last=True
        )
        test_loader = torch.utils.data.DataLoader(
            test_dataset, batch_size=args.bs, shuffle=True, num_workers=4
        )
    else:
        train_loader = torch.utils.data.DataLoader(
            train_dataset, batch_size=args.bs, shuffle=False, num_workers=4, drop_last=True
        )
        test_loader = torch.utils.data.DataLoader(
            test_dataset, batch_size=args.bs, shuffle=False, num_workers=4
        )
    return train_loader, test_loader


def loss_mix(y, logits):
    criterion = F.cross_entropy
    loss_a = criterion(logits, y[:, 0].long(), reduction="none")
    loss_b = criterion(logits, y[:, 1].long(), reduction="none")
    return ((1 - y[:, 2]) * loss_a + y[:, 2] * loss_b).mean()


def acc_mix(y, logits):
    pred = torch.argmax(logits, dim=1).to(y.device)
    return (1 - y[:, 2]) * pred.eq(y[:, 0]).float() + y[:, 2] * pred.eq(y[:, 1]).float()

J_Johnson · January 21, 2024, 12:36am

The reason for the first epoch taking so long is probably that you are redownloading the dataset each time. Is this training session on your own computer or cloud hosted?

If you want to see the memory footprint increase, you’ll need a much larger batch size, something like 2k+ images per batch. You might even be able to fit the entire dataset on the GPU and still have room to spare. CIFAR images are only 32x32 with 3 channels, which is around 3kB per image.

Beginwithtorch · January 21, 2024, 12:51am

it is cloud hosted i am using google colab and it has consumed too many resources so i need to optimize the code so that it can train fast. I even have the premium version and it takes between 4-5 hours to train a model. It only uses 3-4 GB of the GPU, but I haven’t found a way to optimize it yet…

Beginwithtorch · January 21, 2024, 1:08am

can you reccomend me a batchsize?

J_Johnson · January 21, 2024, 6:34am

How many epochs are you letting it run? Your picture earlier shows ~21 seconds per epoch.

Regarding the batch size, you’d have to experiment to see what the max is given the model, optimizer, data size, etc. But that’s not necessarily going to speed anything up for the following reasons:

You may have a bottleneck elsewhere, such as the num_workers of the dataloader;
A GPU has processing cores and RAM. The number of cores often determine how quickly the calculations are performed. You can think of all the calculations on a given batch like a pile of dirt. And the number of cores is the size of your shovel. Increasing the pile of dirt size doesn’t mean you’re shoveling any faster. For that, you need a bigger shovel.

One other way you might speed things up is to put your data and model into bfloat16. That will cut the size in half and speed up the calculations at a very insignificant loss in precision(which isn’t important for ML, anyways).

Beginwithtorch · January 22, 2024, 7:32pm

J_Johnson:

How many epochs are you letting it run? Your picture earlier shows ~21 seconds per epoch.

Regarding the batch size, you’d have to experiment to see what the max is given the model, optimizer, data size, etc. But that’s not necessarily going to speed anything up for the following reasons:

You may have a bottleneck elsewhere, such as the num_workers of the dataloader;

A GPU has processing cores and RAM. The number of cores often determine how quickly the calculations are performed. You can think of all the calculations on a given batch like a pile of dirt. And the number of cores is the size of your shovel. Increasing the pile of dirt size doesn’t mean you’re shoveling any faster. For that, you need a bigger shovel.

One other way you might speed things up is to put your data and model into bfloat16. That will cut the size in half and speed up the calculations at a very insignificant loss in precision(which isn’t important for ML, anyways).

I let it run for 200 Epochs. The duration of one epoch depends on the Attack which i use. Short info its a big Project where i compere some ML Attacks and Defenses on a CIFAR-10 Dataset. Its a existing git Repo. The problem is i need it for my project in university. Next to me 4 other people try to optimize it but unsuccessful.
I really dont know how i can optimize it. so that’s the reason why i start this threat. I just startet with pythorch sice 1,5 months

minhnh · January 25, 2024, 9:39am

If you have colab pro, I would recommend running nvidia-smi in console while letting the code run in notebook. Then you can see if your model utilized all GPU computational capacity to decide where the bottleneck is.