I am training my networks on my universities “grid engine network”. Unfortunately, I have to restart my script every 30-45 min… The reson for this is that the grid automatically kills jobs that run longer after a certain amount of time, so other students can also use the grid engine.
Since sometimes one epoch takes longer than the time available for each students, I have to save the current index from the train loader batch and restart it from that position.
At the moment I am doing it this way:
found_restart=False for epoch in range(start_epoch, config["num_epochs"]): for batch in train_loader: if restart_condition: # restart if condition is met save_checkpoint(checkpoint) restart_script() break # no restart, continue training idx, (images, labels) = batch # load batches in case of restarting if ( not torch.all(torch.eq(idx, old_index.cpu()))) and not found_restart: # skip until point of restart is found # Load next batch until the position of old index continue # start training if last idx position was found found_restart = True pred = net(images, old_pred) (...)
What I do is keep loading the batches until the indexes are matching up. This is inefficient since I am loading a bunch of images and don’t use them.
Is there a way to tell the train_loader to start loading at a certain index?
Thanks for any help!