Evaluation Model After weight errors

takis17 · March 14, 2023, 5:00pm

Hello everyone,
Hope you are doing well.

So I have modified the weights of a given pre-trained model.
Passing though my evaluation dataset, I cannot detect any changed from before I have change the weights and after with the error induced, even when I changed the weights to extreme cases causing some weights to appear as NaN.

I have stored the corrupted weights back to a specific layer in the model as such

# set manipulated weight to conv layer
with torch.no_grad():
    conv.weight.copy_(weight_float32)
print(conv.weight)

My question is that:

Do I need to re-train the model with the corrupted errors and pass the evaluation dataset
Since the model’s weight are corrupted - I double-check the parameters before and after

ptrblck · March 15, 2023, 12:12am

I’m not sure I understand the issue completely, so please correct me if I miss something.
Based on your description you are manually manipulating the weights (to e.g. invalid values) which is not reflected in the output and thus you think the weights were never updated or are never used.
I don’t know if these parameters are ever used, but your code should work as seen here:

conv = nn.Conv2d(1, 3, 3)
x = torch.randn(1, 1, 24, 24)
out = conv(x)
print(out.abs().sum())
#tensor(664.0641, grad_fn=<SumBackward0>)

with torch.no_grad():
    conv.weight.copy_(float("nan"))
print(conv.weight)
# Parameter containing:
# tensor([[[[nan, nan, nan],
#           [nan, nan, nan],
#           [nan, nan, nan]]],


#         [[[nan, nan, nan],
#           [nan, nan, nan],
#           [nan, nan, nan]]],


#         [[[nan, nan, nan],
#           [nan, nan, nan],
#           [nan, nan, nan]]]], requires_grad=True)

out = conv(x)
print(out.abs().sum())
# tensor(nan, grad_fn=<SumBackward0>)

takis17 · March 15, 2023, 7:59pm

@ptrblck

Thank you, really appreciate your help.
Correct, that is what I am doing, except for the only difference that not all the values are nan - I randomly shuffle values and change them to one.
My code follows the same logic as the one you provided, but I wanted to verify that I am actually correctly copying the weights back to my model - since at the end a tensor is immutable, that is why I converted it to a numpy array.
If that is true ; then
Does it make sense re-training the model and then pass the evaluation dataset (I imagine by doing that the weights will adapt but will still be different from the “original” for example if I train it for 1 epoch) hence might detect some changes
Or just pass it through after model.eval()

ptrblck · March 16, 2023, 12:31am

I think it depends on your actual use case if a re-training is desired or not.
The main issue I saw from your original post was that it seemed as if the weight changes were not used at all, which should not be the case as seen in my code snippet. Also setting a single value would already reflect the changes without any re-training:

conv = nn.Conv2d(1, 3, 3)
x = torch.randn(1, 1, 24, 24)
out = conv(x)
print(out.abs().sum())
#tensor(664.0641, grad_fn=<SumBackward0>)

with torch.no_grad():
    conv.weight[0, 0, 0, 0].copy_(float("nan"))
print(conv.weight)
# Parameter containing:
# tensor([[[[    nan,  0.0670,  0.3143],
#           [-0.3256, -0.1418, -0.2889],
#           [ 0.2058, -0.0386, -0.0661]]],


#         [[[ 0.1033, -0.0968, -0.1645],
#           [-0.2478,  0.0022, -0.1859],
#           [ 0.1836, -0.2671,  0.1425]]],


#         [[[-0.1508, -0.3039, -0.3118],
#           [ 0.2929, -0.2692, -0.1859],
#           [ 0.1304, -0.0377,  0.0886]]]], requires_grad=True)

out = conv(x)
print(out.abs().sum())
# tensor(nan, grad_fn=<SumBackward0>)

I don’t know where the new values come from, but assuming you are trying to use other pre-trained weights a re-training might not be necessary.

takis17 · March 16, 2023, 7:43pm

@ptrblck
Thank you for your reply.
Just to clarify, I am using a pre-trained model weights
Here is some related pieces of my code.

# Copying weights of model to conv_1
conv_1 = test_model.conv.weight
print(conv_1)
print(conv_1.size()) # SIZE
print(conv_1.dtype) # TYPE

Output:

Parameter containing:
tensor([[ 0.0013, -0.0214, -0.0093,  ...,  0.0155,  0.0124, -0.0157],
        [-0.0005,  0.0098,  0.0049,  ...,  0.0025, -0.0032, -0.0067],
        [ 0.0001, -0.0011,  0.0112,  ..., -0.0214, -0.0035, -0.0176],
        ...,
        [ 0.0195, -0.0015, -0.0102,  ...,  0.0002, -0.0010, -0.0098],
        [-0.0070,  0.0201,  0.0172,  ..., -0.0128, -0.0091, -0.0014],
        [-0.0203, -0.0179, -0.0067,  ..., -0.0168, -0.0150,  0.0099]],
       device='cuda:0', requires_grad=True)
torch.Size([100, 2048])
torch.float32

print(conv_1.abs().sum())
# tensor(2260.9946, device='cuda:0', grad_fn=<SumBackward0>)

Then I manipulated some values

# tensor(nan, grad_fn=<SumBackward0>)

So I believe my I do follow the same logic as your code snippet.
So retraining the model is not necessary, what other things can I consider?

ptrblck · March 16, 2023, 9:18pm

Just to clarify: are you setting (some) values in the conv weights to NaN and the output is not showing any invalid values?

takis17 · March 17, 2023, 10:45pm

No, it does show NaN values.
I now think I have found my mistake.
For some reason, wasn’t comparing the same things - my mistake.