I have two stage UNets. The first stage will output two stacks of image rounds:
- an 8-round odd numbered images
- an 8-round even numbered images.
I then stack them on top of each other: x = torch.cat((odd, even)).
Then I pass x into the second stage UNet: out = network(x).
Now my intuition is that I need to reshuffle “out” because they are still in order (8 odds, 8 evens), not (odd, even, odd, even, odd, even, odd, even).
But my colleague said the entire network and architecture will do the reshuffling, and I don’t need to do reshuffling myself.
Is this true? Why is that? He also mentioned the idea of permutation invariant/agnostic, but that kind of contradicts what he suggested here?
I don’t know what “the entire network and architecture will do the reshuffling” means, but would claim the importance of reshuffling the data depends on the actual batch.
I.e. if you are concatenating the images in the batch dimension, I don’t think the order matters as the batch will be processed as one input. On the other hand, if the concatenation is done in another dimension, you should check which layers “use” this dimension. E.g. if it’s the channel dimension, I would expect that conv kernels “learn” the order.
Thank you for your reply!
Sorry for the confusion, but I mean inserting an image sequence in randomized order (e.g. [9,0,3,6,5,2,1,8,7,4], there is a sequence of 10 images here), into the UNet, and we expect the network to learn and re-group the image sequence into [0,1,2,3,4,5,6,7,8,9].
Now that I have read much more articles, I realized that this is termed as permutation invariance.
Some suggest adding a regularization term that penalizes the incorrect order.
While some suggest that we can “add permutation invariance” feature in the hidden layers.
(machine learning - How could we build a neural network that is invariant to permutations of the inputs? - Artificial Intelligence Stack Exchange)
Regarding “E.g. if it’s the channel dimension, I would expect that conv kernels “learn” the order.”
Ah, I see. To be specific, I am pretty much using a traditional UNet. Does that mean we expect that UNet Conv kernels would “learn” the order?
I’m unsure which dimension the sequence is in. Based on your description it seems as if you are dealing with a sequence of images so are you using a 3D-Unet with the temporal dimension (sequence) as the “depth”?
Sorry for the confusion!
I am using a 2D UNet with bits dimension (a 16-round sequence of on and off signals). Each image is a gray-scale image.
Hence, we have (batch=32, channels=16, *img_size)
We have a UNet(in_channels=16, out_channels=16).
Thanks for the clarification. I’m not sure if I’m the best to answer this question so let’s also wait for others to chime in, but based on the last description I would expect the conv layers to depend on the channel ordering. If you are not keeping it consistent I would claim the use case could be seen as shuffling the color channels of input images randomly and expect the model to “reshuffle” it somehow.