Loading batch at certain index (dataloader) (restarting training)

mayool · June 29, 2020, 11:57am

Hey everyone,

I am training my networks on my universities “grid engine network”. Unfortunately, I have to restart my script every 30-45 min… The reson for this is that the grid automatically kills jobs that run longer after a certain amount of time, so other students can also use the grid engine.

Since sometimes one epoch takes longer than the time available for each students, I have to save the current index from the train loader batch and restart it from that position.
At the moment I am doing it this way:

found_restart=False

for epoch in range(start_epoch, config["num_epochs"]):
    for batch in train_loader:
        if restart_condition: # restart if condition is met
            save_checkpoint(checkpoint)
            restart_script()
            break

        # no restart, continue training
        idx, (images, labels) = batch

        # load batches in case of restarting
        if (
        not torch.all(torch.eq(idx, old_index.cpu()))) and not found_restart:  # skip until point of restart is found
            # Load next batch until the position of old index
            continue 
        # start training if last idx position was found
        found_restart = True
        pred = net(images, old_pred)
(...)

What I do is keep loading the batches until the indexes are matching up. This is inefficient since I am loading a bunch of images and don’t use them.
Is there a way to tell the train_loader to start loading at a certain index?

Thanks for any help!