Hey everyone,
I am training my networks on my universities “grid engine network”. Unfortunately, I have to restart my script every 30-45 min… The reson for this is that the grid automatically kills jobs that run longer after a certain amount of time, so other students can also use the grid engine.
Since sometimes one epoch takes longer than the time available for each students, I have to save the current index from the train loader batch and restart it from that position.
At the moment I am doing it this way:
found_restart=False
for epoch in range(start_epoch, config["num_epochs"]):
for batch in train_loader:
if restart_condition: # restart if condition is met
save_checkpoint(checkpoint)
restart_script()
break
# no restart, continue training
idx, (images, labels) = batch
# load batches in case of restarting
if (
not torch.all(torch.eq(idx, old_index.cpu()))) and not found_restart: # skip until point of restart is found
# Load next batch until the position of old index
continue
# start training if last idx position was found
found_restart = True
pred = net(images, old_pred)
(...)
What I do is keep loading the batches until the indexes are matching up. This is inefficient since I am loading a bunch of images and don’t use them.
Is there a way to tell the train_loader to start loading at a certain index?
Thanks for any help!