Since I am not quite familiar with the clamp function, I am wondering is there any good way to avoiding exploiding, will the clamp affect the backpropagation?

First, you should modify the structure of your overall computation so that
you can use log_softmax() for this step. Whether you write your own
version or use pytorchâ€™s, log_softmax() will almost certainly lead to a
numerically more stable computation as it is much less likely to â€śexplodeâ€ť
or underflow to zero.

Also, you should figure out how to structure your computation to use
pytorchâ€™s version (whether it be log_softmax() or softmax()), rather
than reinvent the wheel by writing your own. Why do you need to write
your own version? If worst were to come to worst, couldnâ€™t you just
implement â€śyour own versionâ€ť by writing a wrapper around pytorchâ€™s?

Hi, Frank,
I appreciate your kind replies.
I was trying to implement some neighborhood attention (like Graph Attention Network) where softmax is not directly be able to applied. You have to use scatter_add ( segment sum in tensor flow) and calculate the denominator manually.