How to load big dataset file for NMT task

Aiman_Mutasem-bellh · August 15, 2020, 1:22am

Dear @all,

I have a NMT dataset in size of 199 MB for Training and 22.3 MB for dev. set. , batch size is 256, and the max-length of each sentence is 50 words. The data is loaded to GPU RAM without any problems when I start training I got Out of memory error.

SRC = Field(tokenize= normalizeString, init_token='<sos>', eos_token='<eos>', fix_length = 50, batch_first=True)
TRG = Field(tokenize= normalizeString, init_token='<sos>', eos_token='<eos>', fix_length = 50, batch_first=True) 

train_data, valid_data = TabularDataset.splits(path='./data/',train='SCUT_train.csv',
    validation=''SCUT_.csv' , format='csv',
    fields=[('src', SRC), ('trg', TRG)], skip_header=True)

SRC.build_vocab(train_data, min_freq = 2) 
TRG.build_vocab(train_data, min_freq = 2)

BATCH_SIZE = 128

train_iterator, vali_iterator = BucketIterator.splits((train_data, valid_data), sort_key=lambda x: len(x.src),
     batch_size = BATCH_SIZE, device = device)

The data is not that huge size in total 221.3 MB.

I have found many techniques online to overcome the challenge of loading huge data (GBs), but it is almost for image processing tasks.

Kindly, any suggestion to fix this issue, and what is the optimal way to slice and load big dataset file for machine translation task.

Regards,

vdw · August 15, 2020, 8:37am

Does the memory error occur directly when starting the training or after some little time? If the latter, maybe there’s a memory leak in the training itself.

Did you try with (much) smaller batch sizes?

Aiman_Mutasem-bellh · August 15, 2020, 8:58am

Yes, sir @vdw it works well when I minimize the batch and the sentence length. I believe the is a dataset size issue.

Currently, my dataset is small (212.3 MB) and I can’t use the whole training set, what if I use 1GB data or more?

vdw · August 15, 2020, 9:21am

When it works fine with smaller batches, then the dataset size is not an issue. I usually work with smaller batches such as 32 or 64 for text processing tasks including NMT. There are several post discussing the effect of batch size and GPU memory consumption (e.g., this thread.

If the dataset size really becomes an issue – 212MB should cause any – you may also consider splitting the dataset. Say your current training loads the dataset from a CSV file like in this pseudocode:

data_loader = DataLoader('dataset.csv')

for epoch in range(1, 100):
    for batch in data_loader:
        inputs, targets = batch[0], batch[1]
        outputs = model(inputs)
        loss = calc_loss(outputs, targets)
        ....

You can split the original CSV file into , say, 3 equally sized chunks and do the following:

    for file in ["dataset01.csv", "dataset02.csv", "dataset03.csv"]:
        data_loader = DataLoader(file)
        for batch in data_loader:
            inputs, targets = batch[0], batch[1]
            outputs = model(inputs)
            loss = calc_loss(outputs, targets)
            ....

Aiman_Mutasem-bellh · August 15, 2020, 9:54am

This is a huge support sir thank you