Why does the ChatBot tutorial reverse sort the pairs in the batch?

I was going through the chatbot tutorial and saw the following:

    pair_batch.sort(key=lambda x: len(x[0].split(" ")), reverse=True)

why does the tutorial (reverse) sort? Is there a good reason?


# Returns all items for a given batch of pairs
def batch2TrainData(voc, pair_batch):
    pair_batch.sort(key=lambda x: len(x[0].split(" ")), reverse=True)
    #print( [len(pair[0].split(" ")) for pair in pair_batch] )
    input_batch, output_batch = [], []
    # seperate the pairs (for the current batch) in X and Y values
    for pair in pair_batch:
    # returns a matrix of (max_length,batch_size) corresponding to the feature vectors X
    inp, lengths = inputVar(input_batch, voc)
    output, mask, max_target_len = outputVar(output_batch, voc)
    return inp, lengths, output, mask, max_target_len

I just found out that the packing documentation says:

For unsorted sequences, use enforce_sorted = False. If enforce_sorted is True, the sequences should be sorted by length in a decreasing order, i.e. input[:,0] should be the longest sequence, and input[:,B-1] the shortest one. enforce_sorted = True is only necessary for ONNX export.

so now the question is, if I should be sorting myself or if I should let pytorch do the sorting internally? Which one is the best or recommended or standard one?



so I decided to have the sorting happen in the dataprocessing with dataset/dataloader cuz these allow num_workers>0 which probably allows for parallelization and speed ups because we can process many things at the same time.

Obviously I’d need to benchmark to check what is the fastest but that doesn’t sound like its worth my time. So I’d just go with my gut what seems a fine solution.

Feel free to contribute a suggestion!

cross posted: https://qr.ae/TWnbg9

the only thing I’m not sure is if to use the sorting as suggested by the ChatBot tutorial (i.e. doing it in Python or if there is some Pytorch way to do it that is better/more efficient).

For now I will sort in Python directly and not use Pytorch.