Check the attention_maps and raw_features, as well. If they are not scalers, and they do not participate in the training (contribute to it, and you are just using them as pure outputs) considert doing a detach() on them as well.
Check the attention_maps and raw_features, as well. If they are not scalers, and they do not participate in the training (contribute to it, and you are just using them as pure outputs) considert doing a detach() on them as well.