# Shuffle a tensor a long a certain dimension

Dear all,

I have a 4D tensor [batch_size, temporal_dimension, data, data], the 3d tensor of [temporal_dimension, data, data] is actually my input data to the network. I would shuffle the tensor along the second dimension, which is my temporal dimension to check if the network is learning something from the temporal dimension or not. Will be glad if this shuffling is kind of reproducible .

Is there a certain way to do that?

Thanks.

If I understand your use case correctly, you would like to be able to revert the shuffling?
If so, this should work:

``````# setup
N, M, K = 2, 5, 2
x = torch.arange(N*M*K).view(N, M, K)
print(x)
> tensor([[[ 0,  1],
[ 2,  3],
[ 4,  5],
[ 6,  7],
[ 8,  9]],

[[10, 11],
[12, 13],
[14, 15],
[16, 17],
[18, 19]]])

# shuffle
idx = torch.randperm(x.size(1))
y = x[:, idx]
print(y)
> tensor([[[ 8,  9],
[ 2,  3],
[ 0,  1],
[ 4,  5],
[ 6,  7]],

[[18, 19],
[12, 13],
[10, 11],
[14, 15],
[16, 17]]])

# reverse the shuffling
idx_inv = torch.sort(idx).indices
print(y[:, idx_inv])
> tensor([[[ 0,  1],
[ 2,  3],
[ 4,  5],
[ 6,  7],
[ 8,  9]],

[[10, 11],
[12, 13],
[14, 15],
[16, 17],
[18, 19]]])
``````

Hi @ptrblck ,

Thanks a lot for your response. I am not really willing to revert the shuffling.

I have a tensor coming out of my training_loader. It is of the size of 4D `[batch_size, num_steps, data_0, data_1]`. What I want to do before feeding the data to the model is to shuffle the data along my temporal dimension which is `num_steps`. So I willing to shuffle this 4D tensor along the 2nd dimension `num_steps` and afterward forwarding it to the model, just to check if my model learning something from the temporal dimension or not.

In that case the indexing with `idx` created by `randperm` should work and you could skip the last part. This would shuffle the `x` tensor in `dim1`.

Thanks a lot, @ptrblck .

Well, I think what you are doing should be exactly the same as:

``````tensor = torch.arange(N*M*K).view(N, M, K)
dim = 1
idx = torch.randperm(tensor.shape[dim])
t_shuffled = tensor[:,idx]
``````

Am I right? I tried your approach and this, however, the output is not the same. `t_shuffle` is not as `y`

I just need to be sure that when the data is shuffled along the `num_steps` dimension the `data_0` and ` data_1` corresponding to the `num_step` being shuffled as well. Such that the 3d block of the data keeps being logical.

Yes, the codes should be equal as your code just replaces some variable names.
Could you post an example of the input data and the desired output, please?

Now, I have another general question. I am setting `shuffle=True` for my training dataset loader. But anyhow what comes out of the training loader will be of the 4D size `[batch_size, num_steps, data_0, data_1]`. I need to shuffle this tensor along the 2nd dimension as mentioned before. Then I will unsqueeze and add extra dimension so the tensor will be `[batch_size, num_steps, 1, data_0, data_1]`. Now I will send this tensor to my model which is a CNN and I will iterate over the `num_steps` step by step in the forward path such that the input to the CNN is just 1 channel. I stack the output for each step in a way that the output of the forward path will be ``[num_steps, batch_size, number_of_classes]. For the loss calculation I will then do as well loss for each num_step some up thier losses then do backpropagation. So. from my point of view as I am calculating the loss for each step the final loss should also be the same whether I shuffled or not, and the parameters should be updated in the same way.

My aim is then to find that the CNN without shuffeling and with shuffeling should give me the same final accuracies as it should not be remebering any kind of memory. That is why I need to shuffle and check. But I made one approach now and I found that when I shuffled a long the `num_steps` the final accuracy changed as well? does that make any sense to you?

I’m not sure I understand the concern. Since you are creating the `idx` in both cases randomly, a different result would be expected.
If you rerun the code, you should see different results unless you seed the code.

Assuming you don’t have any randomness in the model (e.g. dropout) or any layers, which are using the shuffled dimension in a sequential manner, then your assumption might be correct. To verify it you could create a single ordered and shuffled batch, calculate the loss as well as the gradients, and compare both approaches.

I have to admit it will not be easy to figure out what I am trying to say  .

Anyhow, if you can just confirm for me that they way I am doing the shuffeling above, when I just added the batch size to make the tensor 4D as it is in my case. is correct, I will be glad to try it out and come back if I have further questions Hi @ptrblck ,

I tried now to shuffle the data but I figured out why the final loss coming out of the CNN is different when I apply the shuffling to the data as I showed above. I find that even the very first few samples in my batch are not shuffled, however, the floats are not exact in the loss and that is why the final loss and the weights will be updated differently.

So here is the loss when I have no shuffle applied

``````[214.9400634765625, 205.25238037109375, 204.9016571044922, 207.81027221679688, 204.00399780273438, 203.37063598632812]
``````

And here is how it looks like when I apply shuffle

``````[214.94007873535156, 205.2523956298828, 204.90162658691406, 207.81028747558594, 204.0040740966797, 203.37075805664062]
``````

You can notice that the decimal points are different which will result in a smooth increase in the final loss but epoch after epoch they will accumulate and accordingly the final accuracy will differ.

Do you have any clue why this occurs?

Also, a weird behavior when I use torch.round on the losses. So the losses in the case where I apply no shuffle should be:

``````[214.9400634765625, 205.25238037109375, 204.9016571044922, 207.81027221679688, 204.00399780273438, 203.37063598632812]
``````

When I apply rounding the output come out to be:

``````[215, 208, 211, 216, 212, 216, 209, 225, 207, 212, 220]
``````

You can see they are completely different from their float representation. I do not know what is going on However, in this case shuffeling and not shuffeling will be the same Really thanks and looking forward to your response.

These small errors are most likely caused by the limited floating point precision and a different order of operation as seen here:

``````x = torch.randn(10, 10, 10)
sum1 = x.sum()
sum2 = x.sum(0).sum(0).sum(0)
print((sum1 - sum2).abs().max())
> tensor(4.7684e-07)
``````

I cannot reproduce this issue:

``````x = torch.tensor([214.9400634765625, 205.25238037109375, 204.9016571044922, 207.81027221679688, 204.00399780273438, 203.37063598632812])
y = torch.round(x)
print(y)
> tensor([215., 205., 205., 208., 204., 203.])
``````

and get the expected results. Could you check, if my code snippet reproduces the issue in your setup?

Thanks a lot @ptrblck ,

These small errors are most likely caused by the limited floating-point precision and a different order of operation

So as far as I can understand that this is normal and we do not have anything to do with it.

If you need more numerical precision, you could use `float64`, but note that the performance would be worse.

Based on the new code snippet you are accumulating the loss so I’m not sure which loss value you are printing.
In any case, in case the rounding issue is reproducible for you, please post an executable code snippet, so that we could debug it.

1 Like

I understand that @ptrblck has solutions to your technical problem in this thread. I wanted to ask you though: why do you think shuffling the temporal dimension is a viable thing to do? Do you assume that the instances are independent along the time dimension? If not, then shuffling is absolutely not the thing to do!

If your model has time dependence, then you do not want your model learning the past from the future. You need to take care in only training the model on timesteps that occur before the test data (in time). You could think of this similarly to text encoding, wherein you mask later words in a sentence because you don’t want later words to affect which words to predict next.

Does this make sense? (although if your data/model are time independent, disregard this entire post)

I only used it as a test to check that the model which does not have a temporal dimension will not be affected with shuffling compared to the model with a temporal dimension. Nothing more 