I often come across tasks where we need to draw a bunch of distractors for each of the samples in a batch. eg word2vec and similar.

One way to do this would be to draw the distractors from the rest of the dataset, with the batch masked off.

Another way is simply shuffle the entire dataset at the start of an epoch, and draw batches from that as the distractors. If there are more than ~few thousand examples, then the frequency of collisions with the actual examples will be low.

Both of these are approximations. It seems like a ‘correct’ approach would be something like, thinking this through:

in a world without distractors, let’s say we want to shuffle our dataset, and there are 3 elements, then there are:

`3 * 2 * 1`

… ways of arranging the dataset.

In the case we want to shuffle it as a set of distractors, we essentially reduce the number of available choices at each position by 1, so there will be:

`2 * 1 * (1)`

… ways of arranging the distractors

sanity check: we have:

```
# 0 * * CANNOT
# 1 0 2 CANNOT
# 1 2 0
# 2 0 1
# 2 1 * CANNOT
```

= 2 ways.

So, to shuffle a set of distractors, we can:

draw instead of eg `np.random.choice(3, 3, replace=False)`

, instead draw: `np.random.choice(2, 2, replace=False)`

, and then do some magic to rearrange this into the distractor idxes. However, then it becomes a bit non-obvious to me [edit: how to implement this in an efficient fashion]. Since the later possible choices depend on the earlier choices.

Thoughts? Standard approaches?