Dataset size and limited shared memory

Crispolo · January 26, 2023, 7:45pm

I’m trying to train a network on Colab, but I have a problem of memory.

The training cannot start because I obtain the following message:
RuntimeError: DataLoader worker (pid 12945) is killed by signal: Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limit.

I’m new to PyTorch and Colab and I’m not sure the problem is really the size of the data or maybe something else in the code.

I use a dataset of 47721 images, about 3.25 GB.

I create three dataloader:

training 60%
validation 20%
test 20%

For training I use minbatch of size 32.

I use the free version of Colab, which has about 12 GB of RAM. When I start the runtime about 5 GB are already occupied, but 7 are free.

As model I use a pretrained GoogLeNet.

I’m not sure if maybe I’m doing something wrong when I create the dataloader, here below you can find the code:

def getDataLoader(dataset, batchSize=BATCH_SIZE, shuffle=False, dropLast=False):
  print('Splitting dataset into train and validation datasets...')
  trainDs, validDs = randomSplitDataset(dataset)

  validDataLoader = DataLoader(validDs,
                               batch_size=(len(dataset)
                               if batchSize is None
                               else batchSize),
                               shuffle=shuffle,
                               num_workers=WORKERS,
                               drop_last=dropLast)
  
  trainDataLoader = DataLoader(trainDs,
                               batch_size=(len(dataset)
                               if batchSize is None
                               else batchSize),
                               shuffle=shuffle,
                               num_workers=WORKERS,
                               drop_last=dropLast)

  return trainDataLoader, validDataLoader

Please let me know if there is something else I can share, in order to understand if the problem is my code or really the size of the data.

ptrblck · January 27, 2023, 10:30am

Try to decrease the num_workers as each of them would use shared memory which might be causing this issue.
If this doesn’t help, the crash might be unrelated to shared memory usage.

Crispolo · January 27, 2023, 11:37am

I tried with num_workers equal 0, but nothing change.

What do you mean?
What could be the cause or how can I identify it?

Crispolo · January 27, 2023, 4:04pm

I found the problem!

I don’t remember why, but at the beginning of every training step I did the following:

def train(model, source_loader, target_loader,

    print("Training Started")
    model.train()
    results = [] # append loss values at each epoch

    source = list(enumerate(source_loader))
    target = list(enumerate(target_loader))
    train_steps = min(len(source), len(target))

    # start batch training
    for step in tnrange(train_steps):
        _, (source_data, source_label) = source[step]
        _, (target_data, _) = target[step] # unsupervised learning
        if cuda:
            # move to device
            source_data = source_data.cuda()
            source_label = source_label.cuda()
            target_data = target_data.cuda()

        ...

These two operations filled up the memory.

    source = list(enumerate(source_loader))
    target = list(enumerate(target_loader))

I rewrote the code in this way

def train(model, trainDataloader, optimizer, epoch, cuda=False):
  print('Training started')
  model.train()
  results = [] # append loss values at each epoch

  step = 0

  # start batch training
  # go over dataloader batches, labels
  for sourceData, sourceLabel in trainDataloader:
    if cuda:
        # move to device
        sourceData = sourceData.cuda()
        sourceLabel = sourceLabel.cuda()

    ...

Now everything works as it should.