Behavior of dataloader when resuming training from the existing checkpoint

Hello,

As the title states, I have a question on the behavior of the torch dataloader when I resume training process from the existing checkpoint.

My codes for loading dataset and dataloader looks like the following. (I’m using Huggingface’s datasets.)

train_dataset = load_dataset(args.dataset_loading_script_path,
                                     data_files=args.dataset_txt,
                                     split='train')
train_dataset = train_dataset.map(lambda examples: tokenizer(examples['sent_1'],
   max_length=args.max_length, padding='max_length', truncation=True), batched=True)
train_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask',])
train_dataloader = torch.utils.data.DataLoader(train_dataset, batch_size=total_train_batch_size, shuffle=True)

My question is, when I load the existing checkpoint and optimizer and resume training, is there a way to avoid training on batches (or training examples) that I’ve already trained on?

I’m assuming torch dataloader randomly shuffles data and feeds batches into the model each time the training starts and just simply skips the number of trained steps when iterating through the train dataset using dataloader.

If this is the case, will I be able to achieve what I’m trying to do by shuffling dataset before feeding it into the dataloader (and save dataset to reuse) and drop shuffle=True option?

1 Like

Are you trying to train your model for only 1 epoch because you have so much data and it’ll take too long to do more, or are you possibly trying to do 1 epoch because your machine won’t allow it to finish and everything shuts off, so you’d like to save intermediate progress? (Epoch = single pass through your entire dataset) Just asking out of curiousity, no worries if there’s no reason.

As for for your question, I’d do one of the following:

  1. Drop shuffle=True and as you train keep track of an id (either the step number, which will represent what batch you are on, or just the raw id of current sample you’re on). If you’re using a HuggingFace Trainer instance for your model training, you can use callbacks to do this (add a on_step_end or on_step_begin callback to write out current step # to a file, can be found here in the docs). When continuing training, You can slice examples starting from the id you left on, and ending with the last id of the dataset, then append all the samples you’ve already trained with at the end of this slice (essentially shifting the samples you trained with, but putting them at the end). If you don’t care about re-using the samples at the end, you can just use PyTorch’s Subset dataset class.

  2. Keep shuffle=True but have a small function call when you fetch a sample that writes-out the id that’s getting fetched/processed. When continuing training, do a similar process as above (option 1) but rather than working with a single slice from shuffle=False you can slice out a subset of your dataset using the ids you’ve saved

1 Like

Thanks for the super helpful response!

Just to be clear, for the first option you’ve suggested, I should randomly shuffle the order of each training example when creating the dataset (but drop shuffle=True), right?

My training data is over a million pairs of sentences extracted from Wikipedia, and if I don’t keep my dataset randomly shuffled, batches will consist of adjacent sentences from the same Wiki document (which I’m trying to avoid). And yes, the reason I’m trying to train my model for only 1 epoch is that I have so much data, which leads to a super long training time with the currently available resource I have.

1 Like

Just to be clear, for the first option you’ve suggested, I should randomly shuffle the order of each training example when creating the dataset (but drop shuffle=True ), right?

That’s a nice idea too! Just make sure when you continue training that the pre-shuffled dataset is in the same order!

My training data is over a million pairs of sentences extracted from Wikipedia

Woah 0.0

Yea I see the reasoning for this now. Neat problem! :slight_smile:

Gotcha! I’ll combine randomly shuffling the order of my sentences and your suggested option 1.

Thanks again! :slight_smile:

1 Like