How to efficient use dataset and dataloader to get data from the inference of pre-trained model

Max.T · July 18, 2020, 7:33am

Hi, I am training two models in GAN fashion, training one during the other being frozen. Besides, model A’s input is model B’s output. Both model A and B are auto-regressive model - the forward method is a little bit different from the inference, because I use teach forcing during forward.

Now when I train model A, I need to do inference of B. I try to develop an efficient data pipeline relying on dataloader and dataset. Should I use the approach below ?:

dev myDataset():
    def __init__(self):
        ...
    def __len__(self):
        ...
    def __getitem__(self,batch,model_B):
        return model_B.inference(batch)

train_loader = DataLoader(myDataset, ...)

For every couple epochs, I will update model_B, so the data cannot be prepared in advance. I measured the inference consuming time, it will take about 2 seconds per batch. I have about 10,000 batches for training. Could anyone tell me the aforementioned way is efficient? - By efficient, I mean it can do parallel processing and prefetching (model_B can keep inferencing while model_A is being trained)?

Thank you very much!