Iterating over dataloader twice

Hello
Let’s say I have a dataloader (loader) which returns data of the following shape : [BatchSize , 1024]
I have 10000 samples and I want to build a score matrix of the shape (10000 , 10000) such that entry [i,j] denotes the score between sample i and sample j, so far I did it in the following way (setting batchsize = 1)

   d = numpy.zeros((10000 , 10000))
   for i, sample_i  in tqdm(enumerate(loader)):
     for i, sample_j  in tqdm(enumerate(loader)):
       d[i][j] = calculate_somre_score(sample_i , sample_j)

But this takes a lot of time and i’m sure there is some good way to do it whihc takes less time.
Thanks
Best

You could try to increase the batch size of both DataLoaders and index d as d[i*bs:(i+1)*bs].
Also, do you need to compute the full matrix, i.e. is

calculate_some_score(a, b) != calculate_some_score(b, a)

?

Hi,
I didn’t get your answer , how does that compare each sample with each other possible sample in the dataset?
and the score calculation function is not commutative.
@ptrblck

That is what I was asking about, so thanks for clarification. :slight_smile:

In that case, you could potentially speed it up with the batched approach.

can you elaborate more on the batched approach?
thanks

If your loss function accepts batches of inputs, you could assign the batch of scores to d with a range indexing: [i*bs:(i+1)*bs].
This would avoid looping one by one in both loops and could calculate the loss for multiple sample pairs.

Yes , but that does not compare each sample with All Samples In The Dataset, it only compare each sample with every other possible sample shifted by batchsize.

You are right. The outer loop could use a single sample (and expand it to a matching batch size), while the inner loop could use batches.
This would calculate (left is outer loop sample, right inner loop samples):

[0] - [0]
[0] - [1]
[0] - [3]
# next batch
[0] - [4]
[0] - [5]
[0] - [6]
# next outer loop iteration
[1] - [0]
[1] - [1]
[1] - [2]
...