Getting low GPU util with High GPU memory - How to correctly use DataLoader?

Hi, a few beginners questions:
Using a single 1080TI GPU
pytorch 0.4
My model is a simple Feedforward net with 5 hidden layers of 100 relu units.
Basically, I have data sets of roughly 50-450KB, a data set is stored on my regular HD as (mat or pt file) where the x’s & y’s are stored as pytorch tensors.
Basically, I’m getting now roughly 3-5% GPU utilization (looking at windows task manager), while dedicated memory GPU occupies around 5GB/11GB.

Now, here is my code:

    x_train_org,y_train_org= load(data.pt)
    x_train_org = x_train_org[rand_idx, :]
    y_train_org= y_train_org[rand_idx, :]
    training_data = data_utils.TensorDataset(x_train_org, y_train_org)
    train_data_loader = data_utils.DataLoader(training_data, batch_size=batch_size, shuffle=True, pin_memory=True)

    for i in range(0, ep_num):
        estimator.train()
        for x, y in train_data_loader:
            x, y = x.to(device), y.to(device)
            pred_log_probs = estimator.forward(x)
            model_optimizer.zero_grad()
            loss1 = cost_func(pred_log_probs.permute([0, 2, 1]), y)
            loss1.backward()
            model_optimizer.step()

        estimator.eval()
        pred_log_probs = estimator.forward(x_train_org.to(device))
       train_loss[i + 1] = cost_func(pred_log_probs.permute([0, 2, 1]), y_train_org.to(device)).detach().item()

So my questions are:
Is there something wrong with the flow of my code?

Should I use the “.to(device)” directly when feeding the DataLoader with the training data?
Is there any reason for me to really use the DataLoader in case of a single GPU and when all data is already in pytorch tensor data?
Is there anything I can do to speed things up?

Try setting num_workers=1 (or more) in your DataLoader.

This is gives the following error:

An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.

Probably because of the way I’m training the model

Since your dataset is tiny, I don’t think that multiple workers will help you much.
It seems you are just slicing the tensors currently without any transformations.
You could try to load all data, push it to the GPU beforehand and just slice the batch manually in your training loop. Maybe this will speedup your model a bit.
However, your data and model might be just too small to get a high GPU utilization.

As a small side note: you shouldn’t call the forward method of your model, but the model instance itself (estimator(x)), which will make sure to properly register all hooks etc.

1 Like

Hi, Thanks for the response
What do you mean by “push it to the GPU beforehand…”? do you mean to simply call .cuda() on the loaded training data and labels and to simply go over this tensor in batches?

Yes, exactly. This would skip e.g. the default collate function in the DataLoader, which might add some overhead in such a small use case.

1 Like