Hello
Let’s say I have a dataloader (loader) which returns data of the following shape : [BatchSize , 1024]
I have 10000 samples and I want to build a score matrix of the shape (10000 , 10000) such that entry [i,j] denotes the score between sample i and sample j, so far I did it in the following way (setting batchsize = 1)
d = numpy.zeros((10000 , 10000))
for i, sample_i in tqdm(enumerate(loader)):
for i, sample_j in tqdm(enumerate(loader)):
d[i][j] = calculate_somre_score(sample_i , sample_j)
But this takes a lot of time and i’m sure there is some good way to do it whihc takes less time.
Thanks
Best
Hi,
I didn’t get your answer , how does that compare each sample with each other possible sample in the dataset?
and the score calculation function is not commutative. @ptrblck
If your loss function accepts batches of inputs, you could assign the batch of scores to d with a range indexing: [i*bs:(i+1)*bs].
This would avoid looping one by one in both loops and could calculate the loss for multiple sample pairs.
Yes , but that does not compare each sample with All Samples In The Dataset, it only compare each sample with every other possible sample shifted by batchsize.
You are right. The outer loop could use a single sample (and expand it to a matching batch size), while the inner loop could use batches.
This would calculate (left is outer loop sample, right inner loop samples):