How to load big dataset file for NMT task

vdw · August 15, 2020, 9:21am

When it works fine with smaller batches, then the dataset size is not an issue. I usually work with smaller batches such as 32 or 64 for text processing tasks including NMT. There are several post discussing the effect of batch size and GPU memory consumption (e.g., this thread.

If the dataset size really becomes an issue – 212MB should cause any – you may also consider splitting the dataset. Say your current training loads the dataset from a CSV file like in this pseudocode:

data_loader = DataLoader('dataset.csv')

for epoch in range(1, 100):
    for batch in data_loader:
        inputs, targets = batch[0], batch[1]
        outputs = model(inputs)
        loss = calc_loss(outputs, targets)
        ....

You can split the original CSV file into , say, 3 equally sized chunks and do the following:

    for file in ["dataset01.csv", "dataset02.csv", "dataset03.csv"]:
        data_loader = DataLoader(file)
        for batch in data_loader:
            inputs, targets = batch[0], batch[1]
            outputs = model(inputs)
            loss = calc_loss(outputs, targets)
            ....