Right now, the data I am trying to analyze is a set of graphs (shown as a lot of weighted adjacency matrices. Each value in the adjacency matrices means the “strengh of connection” between the two vertices and there are large count value on the diagonal as well larger than 200.) I try to use different kinds of VAE to get latent variable z to cluster those matrices, but right now the loss is nan in the very beginning. Using models like VGAE (variational graph autoencoder), it can’t even run for one epoch and nan will appear on parameters; Using Convolutional VAE, it is just nans (the code for CVAE is at here: https://github.com/sshen82/cvae/blob/main/cvae.py). Since the value in adjacency matrices are Poisson counts and they have actual meanings, so I don’t want to simply normalize them to 0-1. I do know that there are methods (for example, scvi working on scRNAseq. Each value is gene expression count value, so they are definitely larger than 1. https://github.com/YosefLab/scvi-tools) directly working on counts, but don’t understand what they did to overcome the problem. Is this because only Graph VAE and Convolutional VAE are sensitive on count values, or is this what they did that successfully overcome the problem?
Try replacing final ReLU activation (in decoder) with softplus.
Place a breakpoint at F.poisson_nll_loss line and check if network output ranges are sensible.
Use torch.autograd.set_detect_anomaly(True) to find where NaNs appear.
Using softplus actually worked, thank you!