Finished cifar10 and have some questions about mechanics and data loading


As always, as part of my first post I thank the developers for this amazing library that helps a lot of us in our deep learning escapades. I’ve finished running through the first tutorial involving the CIFAR10 dataset and have some questions.


In this code block,

        # forward + backward + optimize
        outputs = net(inputs)
        loss = criterion(outputs, labels)

Could some explain in detail on what is going on here. These might be more of python related questions than a pytorch question but I think its crucial to understand what is happening here. I understand whats happening in an abstract level but not on the code level. In particular,

  1. net is an object, so what is net(inputs) calling? Because its not a constructor, so I’m not sure whats happening here. Also this returns the output but where is this function (if it is a function at all) defined to return the output and what does it do?

  2. Where do we call the forward function for the that was defined as part of the model class? I’m guessing this has something to do with the previous question.

  3. Similar to the first point criterion(outputs, labels), where is this function defined? I checked the docs for crossentropyloss() and its a class that only takes weights and size_average in the constructor.

In the prediction code block,

    outputs = net(Variable(images))
    _, predicted = torch.max(, 1)
    total += labels.size(0)
    correct += (predicted == labels).sum()

This code (net(images)) is similar to the training stage, so I’m not sure how we are “testing” because we don’t have testing mode. For example, in Keras for training we use and testing we use model.evaluate, and I’m not seeing a similar distinction here.

EDIT-1: I got the answers to the above questions from the Learning PyTorch with Examples. It all happens through the _call_ function in python.


  1. Can I get a small dataset from the dataloader for overfitting before I get the whole thing? I’m guessing I could just run the for loop till train_loader[:small_number], any thoughts?

  2. The dataloader only provides train and test, how would I get a validation set out of this?

  3. We print out[0], does it contain the loss for the entire mini-batch? Could I get some pointers on how to keep track of the loss history for entire epochs (for plotting purposes)?

  4. If I want to use GPU, do I have to call the .cuda() function in every place where I have Variables and instantiation of my models? Or is there some global param I can set that automatically makes all the Variables and instantiated net into cuda compatible objects?

  5. Why is recommended over since the latter can be used to save the entire model including architecture and params?

  6. The normalize method in transform takes a list of 2 tuples representing the desired mean and stddev for each of the color channels. Is that calculated within that particular set? How would I normalize the test set with the training set mean and stddev?

  7. Can I add to the post category list or is it strictly confined to the 4 that is defined?

I apologize for a whole lot of questions, most of them born out of ignorance and I’m sure I’ll have more as I start using pytorch for my problems. If I need to split them up into separate posts, please let me know and I’ll edit the post accordingly.

Thanks and I appreciate everyone’s help!


No, these are great questions.

questions 1 & 2
In Python, there are several special methods which user-defined classes can override to allow certain kinds of operations on the class or instances.

They’re all surrounded by double underscores (so they’re called “dunder” methods):

  • __init__ is one of them, which defines the constructor;
  • __str__ is another one – what you implement there defines what Python will do if you call str(obj).
  • The __call__ dunder method defines what Python will do if you call
    an instance of the class as if it were a function.

In PyTorch, the __call__ method of nn.Module instances sets up user-defined hooks if they exist, then calls the instance’s forward method. So calling net(var) is the same thing as calling net.__call__(var), which will itself call net.forward(var) to perform the actual forward pass.

question 3
The same thing is happening here.
nn.CrossEntropyLoss is a class whose constructor takes weights and size_average, but instances of nn.CrossEntropyLoss, including criterion, define a forward method which is called from nn.Module's __call__ method.

The distinction between training and testing modes is actually implemented similarly to Keras, but in PyTorch it’s done with a pair of methods that change the state of the model: model.train() sets the model to training mode while model.eval() sets it to test mode.

Others questions:

  1. Yes, that should work.
  2. I think that means CIFAR doesn’t natively have a validation set? If so you can always split the train set further.
  3. Yes,[0] has the average loss for the minibatch. You can just keep appending it to a list to keep track of the losses for the whole epoch; make sure you use rather than just loss because the temporary buffers for the graph won’t be freed if you keep around a bunch of loss variables.
  4. Yes, you have to call .cuda() on your model and your input data. You shouldn’t have to call it anywhere else – if you’re creating Variables in your model’s forward pass, you should use expressions like Variable( to make the new variable created on the same device (CPU/GPU) as the existing variable.
  5. The latter saves the entire model using Python’s pickling, which is a very precarious way to save complicated custom classes. Basically, it doesn’t actually save the model’s structure, just the names of the classes that built it, so changing your model’s code can lead to weird and unpredictable behavior of loaded pickles while with load_state_dict you know you’re only saving and loading the params.
  6. Not a computer vision person; I don’t know.
  7. If the site lets you add to the list, I think you should go ahead.


Thank you so much for your replies, I appreciate it. It cleared up a lot of stuff. I still have some questions maybe other people can pop in to the conversation.

Validation set:

The following code is how we load the CIFAR10 dataset. For test, we just set train=False

transform = transforms.Compose([
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
train_set = datasets.CIFAR10(root=expanduser(
    '~/learning/cifar10-data'), train=True, download=True, transform=transform)
train_loader = DataLoader(train_set, batch_size=4,
                          shuffle=True, num_workers=2)

My initial intuition was just to set train_loader = train_loader[:small_number] but I got an error:

Traceback (most recent call last):
File “”, line 1, in
TypeError: ‘DataLoader’ object is not subscriptable

Then I thought I could mess with the train_set directly but I got another error:

Traceback (most recent call last):
File “”, line 1, in
File “/home/sudarshan/anaconda3/envs/torch/lib/python3.6/site-packages/torchvision-0.1.7-py3.6.egg/torchvision/datasets/”, line 89, in getitem
File “/home/sudarshan/anaconda3/envs/torch/lib/python3.6/site-packages/numpy/core/”, line 550, in transpose
return _wrapfunc(a, ‘transpose’, axes)
File “/home/sudarshan/anaconda3/envs/torch/lib/python3.6/site-packages/numpy/core/”, line 57, in _wrapfunc
return getattr(obj, method)(*args, **kwds)
ValueError: axes don’t match array

Both these objects have a len function:


So I’m not sure how to get a validation set out of this.

Train/Test mode:

According to the docs, eval has effect only on dropout and batch norm, which makes sense since their functions differ during testing as opposed to training. Further, we don’t explicitly set model.train() or model.eval(), when the testing is happening in the prediction code block.

So where is this flag being set and how do we know its not training again? I can think of two reasons on this works, but I’m not sure which one:

  1. While loading the test_set, the train flag is set to False. Since testing is done on the test_loader (which was instantiated using the test_set), the mode was already set to “test” and testing automatically happened.

  2. We just didn’t calculate loss, the gradients, and update the gradients through the optimization step. Take a look at the code blocks for training and testing:

Training block:

        # forward + backward + optimize
        outputs = net(inputs)
        loss = criterion(outputs, labels)

Test block:

    outputs = net(images)
    _, predicted = torch.max(, 1)
    total += labels.size(0)
    correct += (predicted ==

During testing (aka prediction) we don’t compute the loss, run the backward, and run the optimization.step() which would mean we are just getting the class labels. So by omitting those steps we do the prediction? This makes sense to me after thinking about it, but it would be helpful if I could get confirmation that this is in fact what is happening.

So end of the day question is lets say we have loaded our dataset using standard numpy techniques and converted them into torch Tensors and have (X_train, y_train, X_test, y_test). How we specify when using X_test, y_test do testing as opposed to training (which would just be calculate loss, its gradients and update weights).


Train/test mode is something like this:

#train loop/function
for (images, labels) in train_loader:
    # train code

#test loop or function
for (images, labels) in test_loader
#    test code eg. outputs = net(images)

So, you set the flags before you iterate over the corresponding data loader. You can wrap them in functions which can enable you measure your performance on the validation set after every n training iterations.

Thanks, but I don’t see that flag explicitly set in the examples shown in the tutorials here. Does that mean when prediction is happening in that example it is still training mode?

The eval function changes behaviour of dropout (no nodes are dropped) and batchnorm (use global statistics rather than batch statistics), during testing. This is different from how they behave during training.
For all other operations/layers, the train and test outputs are the same.

The network in the example you linked does not have these layers which is why I suspect they did not call the eval function. Calling the train and eval functions won’t affect the model output. For models which have dropout/batchnorm layers, its quite imperative that you call the eval function before training.

If the dataset is reasonably simple, you can split the dataset like so:

Best regards


1 Like

You are right. I believe I’ve referenced this in my previous posts as well. Unfortunately, this still doesn’t answer my question on how does the system know that I’m training instead of testing. Is it just that I don’t call loss.backward() and don’t propagate the gradients and update the weights?


Thank you for this! This is exactly what I’ve been looking for!
This looks great, but I have one question. It is my understanding that when we decide to validate the model we just use the entire dataset instead of going mini-batch by mini-batch. If we decide to do that, do you just keep track of the running loss and running acc for the validation set’s each mini-batch and average it out to number of mini-batches? Or just use directly for prediction?

Hello @shaun,

the idea of using a validation set is that whatever you plan do to the test dataset would work for the validation set as well (really, you would use the test set’s DataLoader for the val_dataset).

For example, in the MNIST example’s ( test function, you can see how the function sums loss and correct guesses and then compute the average after the loop (for very large validation sets, you would need to look at overflow etc.).

As such, my suggestion would be to feed it to a dataloader that works similar to the test one (for the MNIST example, in fact, you could make test_loader a parameter to the test function and feed the val_loader. That would also do the model.eval() call mentioned earlier).

For “mass testing” I would expect that - like most examples I have seen - you would use minibatches as well if you have a sizeable validation set and aggregate the accuracy. If your validation set happens to fit in memory, you could pass batch_size = len(val_ds) to the validation DataLoader constructor.

Personally, I would prefer to keep the workflow with Dataset->DataLoader->Validation as that is scalable and looks like an efficient use of my time, but you should certainly just do whatever works for you.

Regarding the distinction between test and train: You would call model.eval() for validation and testing, that’s how the model knows.
One thing that changes is that the forward pass’s info will not be kept around to save memory. Many things will not depend on it (so people might leave it out) in terms of numerical results of the forward pass, but for example for models using dropout in the usual fashion (as opposed to Yarin Gal and collaborators), the model needs to know whether you are training or testing in the forward pass before it knows whether you use backward.

Hope this helps, even if it’s just my very limited take on something that ultimately boils down to style preferences and I cannot vouch for my expertise in that.

Best regards


1 Like

this works well, i believe something like this should be part of pytorch, at least as an example

Might be a bit late to the party, but to create a train/val/test split (of 45k/5k/10k) in CIFAR10, this should work :

n_train = 45000
trainset = torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=transform_train)

# create a validation set
valset = deepcopy(trainset)
trainset.train_data = trainset.train_data[:n_train]
trainset.train_labels = trainset.train_labels[:n_train]

valset.train_data = valset.train_data[n_train:]
valset.train_labels = valset.train_labels[n_train:]

# create a test set
testset = torchvision.datasets.CIFAR10(root='./data', train=False, download=True, transform=transform_test)

trainloader =, batch_size=128, shuffle=True, num_workers=2)
valloader =, batch_size=128, shuffle=True, num_workers=2)
testloader =, batch_size=100, shuffle=False, num_workers=2)