Mapping latent space in an efficient way

I have a trained GAN model and I’m in the process of exploring the latent space. I want to keep track of the latent vectors I’ve visited along with some corresponding analyses of the generator’s output at that location (e.g. the output itself, one type of analysis of the output, another type of analysis, etc.).

The way I’ve done this is by hashing the tensor and adding the hashed value as an index in a Pandas dataframe with columns corresponding with the output for that tensor along with each analysis. Whenever I calculate a new latent vector, I check if its hash value is in the dataframe. If it is, I simply access the data stored at that index. If it is not, I input the tensor into the generator function then add the output/analyses to the dataframe with the corresponding hash value for that tensor.

The general idea is I don’t want to waste time re-generating and re-analyzing output from the model if I’ve already done so for a given tensor. Am I going about this in a smart way or is there a better method that I’m not aware of?

Also, a significant amount of the latent vectors I’m calculating are for various types of interpolation (e.g. Tensor A to Tensor B and the n-number of interpolated tensors between the two). In addition to maintaining a record of each latent vector I’ve visited, I’d also like to keep track of paths as well so that if I’ve already calculated the interpolation steps between two latent vectors, I’d like to be able to access the tensors (in the correct order) that compose that path without having to recalculate it. Would it make sense to maintain another dataframe that just holds lists of the hash values for sequences of tensors I’ve already visited? It feels like I should probably have some sort of graph-like data structure to keep track of these paths.

What have other people done?

Thanks!

Some quick thoughts based on my understanding of your setup. These may be totally off, but hoping some of this helps:

  • I think hashing is going to prevent you from retaining any sense of geometry of your latent space, specifically “closeness” between tensors. Retaining that information may be important if you’re trying to get meaningful efficiency gains. (as a total side note, pandas might not be the fastest tool to look things up, but that’s unlikely to matter for you since computing the output is likely orders of magnitude more costly, so that’s the main thing to optimize)

  • Perhaps stating the obvious but (I think, correct me if I’m wrong!) hashing is only going to help you if you’re likely to end up with the exact same tensor repeatedly. Depending on your specific problem, this may or may not be a good assumption (e.g. if the seed to the GAN is a random scalar in some range, perhaps it’s unlikely you’ll end up with the exact same tensor twice. however if there’s categorical inputs our outputs in the network, you may hit the same outputs repeatedly often enough).

  • While you can retain “paths” as you described, the above complaint is now exponentially more true. You have to happen to land both on the same Tensor A and on the same Tensor B for the path to line up with whatever you’ve already calculated.

  • If the output / analysis you’re looking to do is chaotic and has no “local smoothness”, I can’t think of much improvement over your proposed method. The more chaotic the output function is (as a function of the tensor) the less information you can get from your your pre-calculated values about your new, novel tensor.

  • However to the extent you can represent the output / analyses you’re looking to do as something that has local smoothness, then you can think about an approach like the following.

    • Every time you run an analysis for a given tensor, save the pair of tensor and result to a list of “anchor” tensors.
    • Every time you’re looking to analyze a new tensor, check to see if it’s closer than some distance threshold to any of the previously analyzed tensors (here it will be important to pick some good distance measure, you can start with something like the L2 norm of the tensor difference, but maybe that’s not good for your particular use case).
    • For a smooth enough function you can pick a distance threshold such that the output difference between your novel tensor and the anchor tensor would be negligible (in which case either use the cached output for the anchor tensor, or compute the change in output from the anchor, which presumably is faster since you already know the neighborhood you’re in).
    • If that would end up accumulating too many anchor tensors, such that the time spent sifting through the anchors would negate the pre-computing benefit, you can consider projecting onto a lower dimension and computing closeness there. I bet there are some clever algorithms to do this.
    • One complaint here might be that for a high-dimensional space, it may become unlikely to land in the same “distance ball” twice. I think this depends on the true dimensionality of the space - the less orthogonal the dimensions really are, the more likely it is you will land in the same place.

Thanks for the thoughtful response.

I probably should have said that part of the initial process is mapping all the dataset items to their corresponding tensors so it’s very likely that the user will be re-selecting and interpolating between some of the exact same tensors multiple times. Basically I’m trying to reduce repeated calls to both the model generator and subsequent analysis functions, making the assumption that it would be faster to retrieve that data from storage rather than re-compute it since I know users will be first selecting from known tensors—although, I may be mistaken with how efficient this storage method is.

In that case do you think maybe I should use the tensors for the dataset items as the “anchor” tensors you’re suggesting?

Also, what would you suggest instead of Pandas for storing the data?

In that case your suggested approach seems like it might do the trick!

Yep! But I don’t think I’m adding anything to your original suggestion.

With the caveat that I’m not a pandas expert, whenever possible I prefer to use Python builtin types, since an extra library can only add overhead (though it may not). In this particular case, it sounds like you should just use a dictionary with hashes as the keys and your outputs as the values.