Don’t use the .data
attribute, as it might yield these unwanted side effects.
Is your code working now or are you still stuck?
In the second run,
y = autoencoder(x_t,judge=True)
y become nan
y:tensor([[[nan, nan, nan, …, nan, nan, nan]]], grad_fn=)
Check if the normalization inside the model (in particular the torch.std(x)
output) yields valid values and no Infs or NaNs.
I think the problem is not what you said,gentleman.In the second run,I debugged the code and found the weights of all the nn.Conv1d
and nn.ConvTranspose1d
become nan.
ipdb> self.enc1.weight.data
tensor([[[nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan]],
But in the first run,all the weights exist.
Correction:in the second run ,run here actually mean epoch,‘in the second epoch’,‘in the first epoch’
HI Ptrblck,
Sorry, I need to use p-Sigmoid function.
I wrote it here.Would you please help me to write it much better? I think it would be good to remove for loops, to take less time.
I appreciate your help
def forward(self,input):
Sigma=self.Sigma
Beta=self.Beta
Alpha=self.Alpha
Out=torch.zeros(input.shape)
for ii in range(input.shape[0]):
for ii1 in range(input.shape[1]):
for ii2 in range(input.shape[2]):
for ii3 in range(input.shape[3]):
vv=input[ii,ii1,ii2,ii3]
Out[ii,ii1,ii2,ii3]=Sigma*(1/(1+torch.exp((-1*(vv*Alpha))+Beta)))
return Out
I don’t know the shapes of the used tensors, but assuming Sigma
, Beta
, and Alpha
are all scalar tensors, you can directly use the mentioned formula without the loops:
input = torch.randn(10, 10, 10, 10)
Sigma = torch.randn(1)
Beta = torch.randn(1)
Alpha = torch.randn(1)
Out = torch.zeros(input.shape)
for ii in range(input.shape[0]):
for ii1 in range(input.shape[1]):
for ii2 in range(input.shape[2]):
for ii3 in range(input.shape[3]):
vv=input[ii,ii1,ii2,ii3]
Out[ii,ii1,ii2,ii3]=Sigma*(1/(1+torch.exp((-1*(vv*Alpha))+Beta)))
out_fast = Sigma*(1/(1+torch.exp((-1*(input*Alpha))+Beta)))
print((out_fast == Out).all())
> True
Many thanks.Indeed, I want to customized the Sigmoid. I write my own class.
I would appreciate your look, I need to be sure about the code.
class SigmoiidLearn(nn.Module):
def __init__(self):
super(SigmoiidLearn, self).__init__()
self.Alpha = nn.Parameter(torch.ones(1), requires_grad=True)
self.Beta = nn.Parameter(torch.zeros(1),requires_grad=True)
self.Sigma = nn.Parameter(torch.ones(1),requires_grad=True)
def forward(self,input):
Sigma=self.Sigma
Beta=self.Beta
Alpha=self.Alpha
Out=torch.zeros(input.shape)
for ii in range(input.shape[0]):
for ii1 in range(input.shape[1]):
for ii2 in range(input.shape[2]):
for ii3 in range(input.shape[3]):
vv=input[ii,ii1,ii2,ii3]
Out[ii,ii1,ii2,ii3]=Sigma*(1/(1+torch.exp((-1*(vv*Alpha))+Beta)))
return Out
class Generator(nn.Module):
def __init__(self,ngpu,nz,ngf):
super(Generator, self).__init__()
self.ngpu=ngpu
self.nz=nz
self.ngf=ngf
self.l1= nn.Sequential( nn.ConvTranspose2d(self.nz, self.ngf * 8, 3, 1, 0, bias=False),
nn.BatchNorm2d(self.ngf * 8),
nn.ReLU(True))
self.l2=nn.Sequential(nn.ConvTranspose2d(self.ngf * 8, self.ngf * 4, 3, 1, 0, bias=False),
nn.BatchNorm2d(self.ngf * 4),
nn.ReLU(True)) # state size. (ngf*4) x 8 x 8
self.l3=nn.Sequential( nn.ConvTranspose2d( self.ngf * 4, self.ngf * 2, 3, 1, 0, bias=False),
nn.BatchNorm2d(self.ngf * 2),
nn.ReLU(True))
self.l4= nn.Sequential( nn.ConvTranspose2d( self.ngf*2, 1, 3, 1, 0, bias=False), nn.BatchNorm2d(1))
self.l5=nn.Sequential(SigmoiidLearn())
# forward method
def forward(self, input1):
x = self.l1(input1)
x = self.l2(x)
x = self.l3(x)
x = self.l4(x)
x = self.l5(x)
return x
This is great @ptrblck can you share a similar dummy function for Binary Cross Entropy (BCELoss). Thanks a lot
Sure, here is the raw implementation rewritten directly from the docs as well as the stable internal implementation, which doesn’t overflow for large values:
def my_bce_with_logits_loss(x, y):
loss = -1.0 * (y * F.logsigmoid(x) + (1 - y) * torch.log(1 - torch.sigmoid(x)))
loss = loss.mean()
return loss
def my_bce_with_logits_loss_stable(x, y):
max_val = (-x).clamp_min_(0)
loss = (1 - y) * x + max_val + torch.log(torch.exp(-max_val) + torch.exp(-x - max_val))
loss = loss.mean()
return loss
criterion = nn.BCEWithLogitsLoss()
batch_size = 5
nb_classes = 1
# small values
x = torch.randn(batch_size, nb_classes, requires_grad=True)
y = torch.empty(batch_size, nb_classes).uniform_(0, 1)
loss_reference = criterion(x, y)
loss = my_bce_with_logits_loss(x, y)
loss_stable = my_bce_with_logits_loss_stable(x, y)
print(loss_reference)
>tensor(1.0072, grad_fn=<BinaryCrossEntropyWithLogitsBackward>)
print(loss_reference - loss)
> tensor(0., grad_fn=<SubBackward0>)
print(loss_reference - loss_stable)
> tensor(0., grad_fn=<SubBackward0>)
# large values
x = torch.randn(batch_size, nb_classes, requires_grad=True) * 100
y = torch.empty(batch_size, nb_classes).uniform_(0, 1)
loss_reference = criterion(x, y)
loss = my_bce_with_logits_loss(x, y)
loss_stable = my_bce_with_logits_loss_stable(x, y)
print(loss_reference)
> tensor(12.1431, grad_fn=<BinaryCrossEntropyWithLogitsBackward>)
print(loss_reference - loss)
> tensor(-inf, grad_fn=<SubBackward0>)
print(loss_reference - loss_stable)
> tensor(0., grad_fn=<SubBackward0>)
thanks a lot thank you so much
I have a question on this if you don’t mind, the stable version uses the logsumexp trick, is that like using softmax with 2 classes?
My other question has to do with identifying this part (1 - y) * torch.log(1 - torch.sigmoid(x))
corresponding to the stable version? In other words is this (1 - y) * torch.log(1 - torch.sigmoid(x))
stable on its own or is it a better way to implement it?
hi, @ptrblck , i have two questions:
- when i multiply the standard MSE loss with a negative value, and then backward, will it enlarge the distance of pred and label? or will it still minimize the distance of pred and label?
dummy code:
loss = torch.nn.MSELoss(reduction="none")
loss_v = -10 * loss(pred,label).mean()
loss_v.backward()
i mean, will the .backward() function automatically ignore the -10 outside the loss in BP?
- I have two different models and their output are pred1 and pred2, labels are label1 and label2, when i apply a loss function ( such as torch.nn.CrossEntropyLoss()) on them and backward each loss separately, do i need to define the loss class twice? or they can share the same loss class without affect each other?
dummy code:
pred1 = model1(input1)
pred2 = model2(input2)
# option1
metric = torch.nn.CrossEntropyLoss()
loss1 = metric(pred1,label1)
loss2 = metric(pred2,label2)
# option2
metric1 = torch.nn.CrossEntropyLoss()
metric2 = torch.nn.CrossEntropyLoss()
loss1 = metric1(pred1,label1)
loss2 = metric2(pred2,label2)
loss1.backward()
loss2.backward()
should i use option1 or option2? if two model’s BP share same metric function, will they affect each other in gradient backward?
Look forward to your reply, thanks.
- The training will try to lower the loss, so if you negate the loss value, the model will diverge and will try to make the loss “more negative”. Here is a simple example, which shows that the expected parameter will just get more negative values, as it decreases the loss:
torch.manual_seed(1234)
# standard approach
x = torch.ones(1, 1)
y = torch.ones(1, 1) * 2
lin = nn.Linear(1, 1, bias=False)
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(lin.parameters(), lr=1e-3)
for epoch in range(6000):
optimizer.zero_grad()
out = lin(x)
loss = criterion(out, y)
loss.backward()
optimizer.step()
print('epoch {}, loss {}'.format(epoch, loss.item()))
print(out, y)
# negative loss
x = torch.ones(1, 1)
y = torch.ones(1, 1) * 2
lin = nn.Linear(1, 1, bias=False)
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(lin.parameters(), lr=1e-3)
for epoch in range(6000):
optimizer.zero_grad()
out = lin(x)
loss = criterion(out, y)
loss = -1. * loss
loss.backward()
optimizer.step()
print('epoch {}, loss {}'.format(epoch, loss.item()))
print(out, y)
- If the criterion doesn’t contain any internal states (such as buffers etc.), then you could just reuse it (as is the case for
nn.CrossEntropyLoss
).
hi @ptrblck , thanks for your clear and quick reply.
In the training, i have done like you mentioned in answer1. actually, I have 3 losses: the formula is loss1+loss2-loss3
dummy code:
loss1_ = AverageMeter()
loss2_ = AverageMeter()
loss3_ = AverageMeter()
criterion1 = torch.nn.CrossEntropyLoss()
criterion2 = torch.nn.MSELoss(reduction="none")
model = ResNet18() # backbone
fc = nn.Linear() # classification layer connected to the backbone
for loop:
feature1 = model(input1) # input a pair of images to the same backbone, input1 and input2 are two images in a same category
feature2 = model(input2)
pred1 = fc(feature1) # classification layer is also shared
pred2 = fc(feature2)
loss1_v = criterion1(pred1,label)
loss2_v = criterion1(pred2,label)
loss1_v.backward(retain_graph=True) # here, i can't find a way to avoid the use of retain_graph=True, because i need to reuse the graph later, do you have any suggestions?
loss2_v.backward(retain_graph=True)
loss3_v = -10 * criterion2(feature1,feature2).mean() # if i maximize the extracted feature difference
loss3_v.backward()
loss1_.update(loss_v1, 1)
loss2_.update(loss_v2, 1)
loss3_.update(loss_v3, 1)
However, in the training, i get loss1 and 2 are getting smaller and smaller, but the loss3 is bigger and bigger (‘less negative’) … i don’t know why, because you also mentioned, the model will try to make the loss “more negative”…
here is snippet of the loss print:
Training: 2021-05-10 19:47:32,997-Speed 1291.78 samples/sec Loss1 38.0934 Loss2 38.0780 Loss3 -0.0298 Epoch: 0 Global Step: 500 Required: 5 hours
Training: 2021-05-10 19:56:11,611-Speed 918.99 samples/sec Loss1 26.1710 Loss2 26.1691 Loss3 -0.0248 Epoch: 1 Global Step: 1000 Required: 5 hours
Training: 2021-05-10 20:04:47,075-Speed 1287.83 samples/sec Loss1 20.0595 Loss2 20.0388 Loss3 -0.0190 Epoch: 2 Global Step: 1500 Required: 5 hours
Training: 2021-05-10 20:13:04,078-Speed 1288.24 samples/sec Loss1 18.3962 Loss2 18.3840 Loss3 -0.0158 Epoch: 2 Global Step: 2000 Required: 5 hours
Training: 2021-05-10 20:21:21,709-Speed 1290.39 samples/sec Loss1 17.2669 Loss2 17.2503 Loss3 -0.0138 Epoch: 3 Global Step: 2500 Required: 5 hours
Training: 2021-05-10 20:22:11,298-Speed 1290.64 samples/sec Loss1 17.1906 Loss2 17.1985 Loss3 -0.0137 Epoch: 3 Global Step: 2550 Required: 5 hours
Training: 2021-05-10 20:30:17,556-Speed 1286.20 samples/sec Loss1 16.4898 Loss2 16.4976 Loss3 -0.0125 Epoch: 4 Global Step: 3000 Required: 5 hours
Training: 2021-05-10 20:38:35,820-Speed 1287.99 samples/sec Loss1 15.7010 Loss2 15.6975 Loss3 -0.0115 Epoch: 5 Global Step: 3500 Required: 5 hours
Training: 2021-05-10 20:46:52,820-Speed 1289.09 samples/sec Loss1 15.9737 Loss2 15.9601 Loss3 -0.0107 Epoch: 5 Global Step: 4000 Required: 5 hours
Training: 2021-05-10 20:55:48,491-Speed 1290.52 samples/sec Loss1 15.7084 Loss2 15.6894 Loss3 -0.0102 Epoch: 6 Global Step: 4500 Required: 5 hours
so my questions are:
- when the model is shared and do optimization to enlarge the feature difference, why the loss getting bigger and bigger?
- when i want to use a 3rd loss the constrain the feature difference, not prediction, do i have a way to avoid the use of retain_graph=True for more efficient computation?
thank you!
-
I can’t comment on the use case and if the general usage of the loss functions is appropriate, but reducing one loss (and updating the parameters with these gradients) might increase another loss.
-
You would need to use
retain_graph=True
if you are callingbackward
multiple times in order to avoid deleting the intermediate tensors, which are needed to compute the gradients. In case you could create a single loss (e.g. by adding all separate losses) and could thus callbackward
once, you wouldn’t need to useretain_graph=True
.
thank you @ptrblck ! But in my case, regarding your answer2,
I can’t directly sum loss 1,2,3, because as the dummy code shows, loss1 and loss2 constrain the fc layer’s output, but loss3 constrain the backbone’s output.
Do you mean that I can sum the different loss and then backward once, even if they constrain the output of different layers?
hi @ptrblck , I want to add a learnable parameter multiplied with the CELoss, How Can I do this ?
You could create an nn.Parameter
and directly multiply it with the loss output.
However, note that scaling the loss with a trainable parameter might make the parameter just negative to “reduce the loss” while the actual loss value could be increasing, so you have to double check your use case and make sure the training cannot “cheat” using this parameter.
Thak you @ptrblck , that is a very kind reminder
About the nn.Parameter
, I am not clear about how to notify this new parameter to the optimizer?