Hi!
The sudo simulate network looks like:
inut:x
x = x*att
y = mlp(x)
as for att:
att is initialized with all ones, all rand, or all zeros
att = mlp(relu(att))
att = softmax(att)
When I initialized with torch.ones or torch.rand, it can be trained properly. However if I initialize it with all zeros, it remains the same til the end. Just curious, what is the difference between all ones and all zeros as they are exactly the same after softmax operation at very beginning( say [0.25, 0.25, 0.25, 0.25]).
Thank you!