DistributedSampler for validation set in ImageNet example

churchillmic · January 22, 2019, 4:41am

The ImageNet example has a DistributedSampler for the training loader, but not the validation loader. This would appear to have every rank processing the entire data for the validation set. Is this necessary, or could a DistributedSampler be used for the validation loader also, to apply the multiple nodes to processing the validation set?

anil_batra · March 3, 2019, 7:04am

Hi @churchillmic,
I have the same query. Did you able to find the answer for this?

Thanks
Anil

churchillmic · March 5, 2019, 3:15am

I found a couple examples where a DistributedSampler was used for the validation or test set. I’m still not sure why the official Imagenet example doesn’t use it, it still seems wasteful to me. Here are a few of the examples:

github.com

huggingface/pytorch-pretrained-BERT/blob/c9fd3505678d581388fb44ba1d79ac41e8fb28a4/examples/extract_features.py

# coding=utf-8
# Copyright 2018 The Google AI Language Team Authors and The HugginFace Inc. team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Extract pre-computed feature vectors from a PyTorch BERT model."""

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

This file has been truncated. show original

github.com

Jongchan/Pytorch-Horovod-Examples/blob/master/examples/cifar100/main_horovod.py

from __future__ import print_function

import torch
import torch.nn as nn
import torch.nn.init as init
import torch.optim as optim
import torch.nn.functional as F
import torch.backends.cudnn as cudnn
import config as cf

import torchvision
import torchvision.transforms as transforms
import torchvision.datasets as datasets

import os
import sys
import time
import argparse
import datetime

This file has been truncated. show original

pietern · March 5, 2019, 4:23pm

It is not necessary to have every rank process the entire validation set. You can use a distributed sampler and average the errors afterwards to achieve the same result.

Hzzone · June 15, 2020, 5:49am

Actually, you cannot use ddp sampler to achieve validation. You can see DistributedSampler; note that the dataset has added extra samples to make it evenly divisible. Therefore, if your dataset is very small, the final result may be different. The official implementation is right.