How to speed up inference?

Due to the large dataset, I have spend lots of time testing. How to speed up this process?
Main code:

# Some other function such as get args

def validate(valloader, model):
    scores = []
    with torch.no_grad():
        for i_iter, batch in enumerate(valloader):
            images, labels = batch
            images = images.cuda(non_blocking=True)

            preds = model(images)
            outputs = torch.nn.functional.softmax(preds)[:, -1]
    return scores

if __name__ == '__main__':
    args = parse_args()
    print("Calling with args:")

    os.environ["CUDA_VISIBLE_DEVICES"] = args.gpus
    cudnn.enabled = True
    cudnn.benchmark = True
    torch.backends.cudnn.deterministic = False
    torch.backends.cudnn.enabled = True

    # Define dataset
    rdataset = MyDataSet(dataset='test', size=224, mode='both')
    pdataset = MyDataSet(dataset='val', size=224, mode='both')
    rloader = data.DataLoader(rdataset, batch_size=args.batch_size, shuffle=False,
                                  num_workers=1)  # , pin_memory=True)
    ploader = data.DataLoader(pdataset, batch_size=args.batch_size, shuffle=False,
                                num_workers=1)  # , pin_memory=True)

    # Define model
    num_classes = 2
    model = resnet50(pretrained=False, num_classes=num_classes)

    prefix = args.model.split('/')
    pattern = prefix[-1]
    prefix = '/'.join(prefix[:-1])

    all_files = os.listdir(prefix)
    all_models = [os.path.join(prefix, f) for f in all_files if pattern in f]

    new_state_dict = model.state_dict().copy()

    for i in range(len(all_models)):
            old_state_dict = torch.load(all_models[i])
            new_state_dict = dict()
            for k, v in old_state_dict.items():
                new_state_dict[k[7:]] = v
            print("Fail when loading model %s" % all_models[i])

        rscores = validate(rloader, model)
        pscores = validate(ploader, model)

        # Other subprocess
1 Like

I got the cost time of each part:

dataloder of next batch:
about 20s

about 0.7s

copy tensor to cpu and get results:
about 0.0001s

I seems that dataloader cost too much time. How to improve it?

Try increasing the value of num_workers in Dataloader (e.g. num_workers=8). This way, more sub-processes will be utilized for loading data, hence decreasing data load time.

Thank you for your advice. I want to know how to find a suitable num_workers? It has some relationship with the count of cpu?

Ideally, a suitable value for num_workers is the minimum value which will give batch loading time <= inference time. This way, when our model is working on inference of previous batch, data-loader would be able to finish reading the next batch in the mean time.

However, the maximum number of num_workers is also dependent on available cpu resources, so we might not always be able to achieve that ideal number of num_workers.

1 Like

In the latter case, would it be good to just set it to #cores on CPU?