Diffence result of diff pipline load data to cuda(GPU), and use cycle to make infinite iteraion

I’m a newbie of pytorch.

here is my load data function,

def get_data_loader(
    """load data."""
    data = torchvision.datasets.ImageFolder(

    data_loader = torch.utils.data.DataLoader(

    # tag 1
    # to device 
    data_loader = [[x[0].to(device), x[1].to(device)] for x in data_loader]
    # tag 2
    data_loader = cycle(data_loader)

    printsln("n_sample: %d, n_class: %d" % (len(data), len(data.classes)))
    return data_loader, data.classes

and this is my training iteration code,

    for i in range(20000):
        # train iteration
        optimizer = lr_schedule.inv_lr_scheduler(param_lr, optimizer, i, **schedule_param)

        # tag 3
        if use_cycle:
            inputs_source, labels_source = next(iter_source)
            # not use cycle
            if i % len_train_source == 0:
                iter_source = iter(data_loader_source)

            inputs_source, labels_source = iter_source.next()

        # tag 4
        inputs_source, inputs_target = Variable(inputs_source).cuda(), Variable(inputs_target).cuda()

I just use cycleto make infinite iteration(tag 2, tag 3), but I get different result about use_cycle, it decrese my result.

Another, I preload data in get_data_loader to GPU(tag 1), not in iteration load batch_data to GPU(tag 4), it will bosting running efficient, decrese running time, but I still get different result(decrese score).

I find some tutorail code, rarely use cycle to make infinite iteration, and no one load data to GPU without iteration training.

waiting for some indicate, thanks.

Are you sure the difference in your score is due to the usage of cycle?
How much do the results differ and how often have you compared them?
You would have to set the seed and disable non-deterministic behavior to compare both approaches.
See the reproducibility notes for more information.

Usually you won’t use cycle as one whole DataLoader iteration equals one epoch in training.
This makes it easy so track some stats using epochs.
Also it’s pretty uncommon to push the data onto the GPU before training, as this will fill up the GPU leaving less space for your model and training. Usually you try to keep the memory usage as low as possible to be able to fit large models.
Also using a DataLoader with multiple workers, the next batches can be loaded and preprocessed, while the GPU is busy with the training. If your GPU workload is large enough, this will mask the (CPU) loading times.

thanks @ptrblck ! your answer is very instructive!

my work may be a liiter complex, and I do a transfer work, the results do differ when I use pre_cuda with preload data with load function, and use_cycle to make a infinite cycle batch_data, and I add the reproducibility notes some code:

    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

and this code not help me.

I do a transfer learning work, and loss.DAN() to calculate source feature and target feature a distance as a transfer loss. and add the transfer loss to training loss as a total loss to backpropogation.

I have not use a neat deep training work to test this problem, but my confuse is that why running is different, may this problem as your answered, it may cause by CUDA implement.

and my test code:

def transfer_classification():
    # data config
    root_source = ARGS.path_data + ARGS.field_source + "/images"
    root_target = ARGS.path_data + ARGS.field_target + "/images"
    transform = transform224c()
    batch_size = ARGS.batch_size

    # load data
    data_loader_source, classes_source = get_data_loader(

    data_loader_target, classes_target = get_data_loader(

    # net
    base_network, bottleneck_layer, classifier_layer = construct_net()

    # deep loss
    criterion = nn.CrossEntropyLoss()

    # training params
    parameter_list = [{"params": bottleneck_layer.parameters(), "lr": 10},
                      {"params": classifier_layer.parameters(), "lr": 10}]

    # optimizer
    optimizer = optim.SGD(parameter_list, lr=1.0, momentum=0.9, weight_decay=0.0005, nesterov=True)
    param_lr_list = []
    for param_group in optimizer.param_groups:

    ## train
    # for test data efficient
    if ARGS.pre_cuda:
        data_loader_source = [[x[0].to(device), x[1].to(device)] for x in data_loader_source]
        data_loader_target = [[x[0].to(device), x[1].to(device)] for x in data_loader_target]

    if ARGS.use_cycle:
        iter_source = cycle(data_loader_source)
        iter_target = cycle(data_loader_target)
        len_train_source = len(data_loader_source)
        len_train_target = len(data_loader_target)

    for i in range(ARGS.iteration):
        # optimizer
        optimizer = lr_schedule.inv_lr_scheduler(
            param_lr_list, optimizer, i, init_lr=0.0003, gamma=0.0003, power=0.75)

        if not ARGS.use_cycle:
            if i % len_train_source == 0:
                iter_source = iter(data_loader_source)
            if i % len_train_target == 0:
                iter_target = iter(data_loader_target)

        inputs_source, labels_source = next(iter_source)
        inputs_target, labels_target = next(iter_target)

        if not ARGS.pre_cuda:
            inputs_source, labels_source = inputs_source.to(device), labels_source.to(device)
            inputs_target, labels_target = inputs_target.to(device), labels_target.to(device)

        inputs = torch.cat((inputs_source, inputs_target), dim=0)

        features = base_network(inputs)
        features = bottleneck_layer(features)
        outputs = classifier_layer(features)

        classifier_loss = criterion(outputs.narrow(0, 0, batch_size), labels_source)
        classifier_loss_target = criterion(outputs.narrow(0, batch_size, batch_size), labels_target)

        transfer_loss = loss.DAN(features.narrow(0, 0, batch_size),
                                 features.narrow(0, batch_size, batch_size))

        total_loss = classifier_loss + transfer_loss

        loss_logger.update(classifier_loss, classifier_loss_target, transfer_loss, total_loss)

        if i % 100 == 0:

        # log loss
        if i % 20 == 0:  # print every NUM mini-batches
            losses = loss_logger.get_avg()

            logger.info("[%6d] source_loss: %.3f, target_loss: %.3f, transfer_loss: %6.3f, total_loss: %6.3f" %
                        (i, losses[0], losses[1], losses[2], losses[3]))

                {'source': losses[0], 'target': losses[1], 'transfer': losses[2], 'total': losses[3]}, i)

and I paste my results plot, this only contain the training loss of source data

cu: pre_cuda, cy, use_cycle, rc, NOT use random seed

and pre_cuda and use_cycle is different with others

I am not paste my tensorboard results because it occur some error, and I not find a solution.

W1111 23:44:09.107531 Reloader tf_logging.py:120] Detected out of order event.step likely caused by a TensorFlow restart. Purging 500 expired tensor events from Tensorboard display between the previous step: 9980 (timestamp: 1541875220.4483728) and current step: 0 (timestamp: 1541875229.00469).
W1111 23:44:09.107531 140580398749440 tf_logging.py:120] Detected out of order event.step likely caused by a TensorFlow restart. Purging 500 expired tensor events from Tensorboard display between the previous step: 9980 (timestamp: 1541875220.4483728) and current step: 0 (timestamp: 1541875229.00469).
W1111 23:44:09.149514 Reloader tf_logging.py:120] Detected out of order event.step likely caused by a TensorFlow restart. Purging 281 expired tensor events from Tensorboard display between the previous step: 5600 (timestamp: 1541876346.2499464) and current step: 0 (timestamp: 1541905892.754689). 

and I will skip this problem to not use pre_cuda and use_cycle.

anyway, thank you for your help!

I solved the tensorboardX detected problem, just add a RUN_TIME_FLAG suffix with scalers key, like:

                {'source' + RUN_TIME_FLAG: losses[0],
                 'target' + RUN_TIME_FLAG: losses[1],
                 'transfer' + RUN_TIME_FLAG: losses[2],
                 'total' + RUN_TIME_FLAG: losses[3],
                 'advs' + RUN_TIME_FLAG: losses[4]}, i)

I tried the issue suggestion Detected out of order event.step when adding text after scalars with global_step #6 and both solution not work.