How do I match samples with their predictions when doing inference with DistributedSampler?

Hello,
I have trained a torch model for NLP tasks and would like to perform some inference using a multi GPU machine (in this case with two GPUs).
Inside the processing code, I use this

dataset = TensorDataset(encoded_dict['input_ids'], encoded_dict['attention_mask'])
sampler = DistributedSampler(
    dataset, num_replicas=args.nodes * args.gpus, rank=args.node_rank * args.gpus + gpu_number, shuffle=False
)
dataloader = DataLoader(dataset, batch_size=batch_size, sampler=sampler)

For those familiar with NLP, encoded_dict is the output from the tokenizer.batch_encode_plus function where the tokenizer is an instance of transformers.BertTokenizer.

The issue I’m having is that each GPU is doing predictions (i.e. inference) on a subset of the full dataset, and saving the predictions separately; for example, if I have a dataset with 1000 samples to predict, each GPU is predicting 500 of them. As a result, I have no way of knowing which samples out of the 1000 were predicted by which GPU, as their order is not preserved, therefore the model predictions are meaningless as I cannot trace each of them back to their input sample.

I have tried to save the dataloader instance (as a pickle) together with the predictions and then extracting the input_ids by using dataloader.dataset.tensors, however this requires a tokeniser decoding step which I rather avoid, as the tokenizer will have slightly changed the text (for example double whitespaces would be removed, words with dashes will have been split and so on).
What is the cleanest way to save the input text samples together with their predictions when doing inference in distributed mode?

1 Like

Thanks for pointing out this issue!

If you have some sort of identifier for your input (i.e. input_ids as you mentioned) can you build a mapping of {input_id: prediction} on each distributed rank? After inference on N inputs, this will be a dict of size N on each rank and you can use all_gather_object (torch.distributed.distributed_c10d — PyTorch 1.9.0 documentation) to get the entire input → prediction mapping across the world. Would something like this satisfy your use case?

1 Like

Hi @rvarm1
Many thanks for your reply. I think your solution would work, the only complication being that input_ids is not really the best identifier, as it the tokenised input text (effectively each sentence, which is a python string, is tokenised, i.e. converted into a list of integers) whereas I’d like to keep the original string as the id, which is complicated because the splitting of the data into different GPUs happens AFTER the tokenisation.
However, I’ve done some further debugging and I’ve noticed that the data are not actually randomly split across GPUs as I thought. If I set shuffle=False in the DistributedSampler then this happens:
in the case of two GPUs, GPU 0 and GPU 1, all the samples with even index (starting from 0) will be passed to GPU 0, and all those with odd index will be passed to GPU 1.
So for example, if you have 10 samples, whose indices are [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], then samples 0, 2, 4, 6, 8 will go to GPU 0 and samples 1, 3, 5, 7, 9 will to go GPU 1. Therefore this allows me to map the predictions back to the original text string samples by just using this ordering. Not sure if this is a neat solution, as keeping the original text string next to its prediction would be ideal, but at least it works.

N.B. Special case: As the two GPUs must be passed the SAME number of inputs, if the number of inputs is an odd number, for example we have 9 samples with indices [0, 1, 2, 3, 4, 5, 6, 7, 8], then GPU 0 will be passed samples 0, 2, 4, 6, 8 and GPU 1 will be passed samples 1, 3, 5, 7, 0 (in this exact order). In other words, the first sample with index 0 is repeated at the very end of the dataset to make sure each GPU has the same number of samples, in which case we can then write some codes which drops the last prediction from GPU 1 as it is redundant.