A minimal and executable code snippet would be great.
Could you try to remove unnecessary functions and use some random inputs, so that we can reproduce this issue locally?
You can see that the first call to scheduler. Step should have been fault-free because it printed the following print statement, as well as eval for the model. But an error should have occurred on the second call
Thanks for the code so far.
Could you also post the code you are using to initialize the model, optimizer, and scheduler?
Also, could you try to run the code on the CPU only and check, if you see the same error?
If not, could you rerun the GPU code using CUDA_LAUNCH_BLOCKING=1 python script.py args and post the stack trace again?
I am very sorry for replying to you a few days later, because I am a sophomore student in university and I have a lot of things to do recently, so I didn’t deal with this problem for a few days. As you requested, I added this line of code. The problem is that it returned the value successfully without any problems during the first epoch. But after the second epoch, it reported an error
29/5000
Thank you for answering my question so patiently. My code should not have this problem。
If I use following code,the error will not appear( it will appear when using the annotating code):
for epoch in epochs:
for batch in train_loader:
#scheduler.step()
scheduler.step()
Here is my complete training code(When you put scheduler. Step () into each batch iteration,error will apear):
for epoch in range(epochs):
model.train()
epoch_loss=0
epoch_mask_loss=0
epoch_label_loss=0
for image,mask,label in train_dl:
optimizer.zero_grad()
r=np.random.rand(1)
#cutmix transform
if r>threshold:
lam=np.random.beta(50,50)
image,mask,cutmix_label=make_cutmix(image,mask,lam)
image=image.to(device).float()
mask_prediction,label_prediction=model(image)
label_prediction=label_prediction.to('cpu')
label_loss=lam*label_criterion(label_prediction,label)+(1-lam)*label_criterion(label_prediction,cutmix_label)
else:
image = image.to(device).float()
mask_prediction,label_prediction=model(image)
label_prediction=label_prediction.to('cpu')
label_loss=label_criterion(label_prediction,label)
mask_prediction=torch.sigmoid(mask_prediction)
mask_prediction = mask_prediction.to('cpu')
mask_loss=mask_criterion(mask_prediction,mask)
epoch_mask_loss+=mask_loss
epoch_label_loss+=label_loss
loss=label_loss+mask_loss
epoch_loss+=loss
loss.backward()
optimizer.step()
epoch_loss = epoch_loss / total_steps
epoch_label_loss=epoch_label_loss/total_steps
epoch_mask_loss=epoch_mask_loss/total_steps
print('epoch:{},epoch_loss:{},epoch_label_loss:{},epoch_mask_loss:{}'.format(epoch,epoch_loss,epoch_label_loss,epoch_mask_loss))
model.eval()
metric,metric2,valid_loss=evalue(model,valid_dl)
if metric2>best_score:
state={'state':model.state_dict(),'best_score':metric2}
torch.save(state,checkpoint_path)
best_score=metric2
logging.warning('epoch_loss:{},metric1:{},metric2:{}'.format(epoch_loss,metric,metric2))
scheduler.step()
In the end it seems like the number of epochs you had mentioned in your scheduler was less than the number of epochs you tried training for.
I went into %debug in the notebook and tried calling self.get_lr() as suggested.
I got this message:
*** ValueError: Tried to step 3752 times. The specified number of total steps is 3750
Then with some basic math and a lot of code search I realised that I had specified 5 epochs in my scheduler but called for 10 epochs in my fit function.
If get_lr() throws an error, pytorch suppresses it but will later encounter this “values” unbounded bug. Fix the get_lr() error and this bug will go away.
Python has lexical scoping by default, which means that although an enclosed scope can access values in its enclosing scope, it cannot modify them (unless they’re declared global with the global keyword). A closure binds values in the enclosing environment to names in the local environment. The local environment can then use the bound value, and even reassign that name to something else, but it can’t modify the binding in the enclosing environment. UnboundLocalError happend because when python sees an assignment inside a function then it considers that variable as local variable and will not fetch its value from enclosing or global scope when we execute the function. To modify a global variable inside a function, you must use the global keyword.
Hello,I had the same error that I can’t solve it.
I want to ask you some question about it.Thank you.
What is the ‘The specified number of total steps is 3750’?
How to change the number of steps?
Thank you.
Hi make sure that your dataloader and the scheduler have the same number of iterations. If I remember correctly I got this error when using the OneCycle LR scheduler which needs you to specify the max number of steps as init parameter. Hope this helps! If this isn’t the error you have, then please provide code and try to see what your scheduler.get_lr() method returns.