Using Batch Sampler in a multi-gpu scenario

roman-vygon · May 12, 2021, 1:03pm

Hello, I have a piece of code that uses a torch.utils.data.DataLoader with a custom BatchSampler to sample batches with the same amount of objects in each class.

github.com

roman-vygon/triplet_loss_kws/blob/master/layers/datalayer.py

from functools import partial
from typing import Any, Dict, List, Optional, Union

import numpy as np
import torch
from nemo import logging
from nemo.backends.pytorch import DataLayerNM
from nemo.collections.asr.parts.dataset import AudioLabelDataset, seq_collate_fn
from nemo.collections.asr.parts.features import WaveformFeaturizer
from nemo.collections.asr.parts.perturb import AudioAugmentor
from nemo.collections.asr.parts.perturb import perturbation_types
from nemo.core.neural_types import *
from torch.utils.data.sampler import BatchSampler


class BalancedBatchSampler(BatchSampler):
    """
    BatchSampler - from a MNIST-like dataset, samples n_classes and within these classes samples n_samples.
    Returns batches of size n_classes * n_samples
    """

This file has been truncated. show original

I’m trying to use it in a multi-gpu scenario with NeMo framework. By default when in multi-gpu mode it should be something like this:

if self._placement == DeviceType.AllGpu:
    sampler = torch.utils.data.distributed.DistributedSampler(self._dataset)

self._dataloader = torch.utils.data.DataLoader(
            dataset=self._dataset,            
            sampler=sampler,
            num_workers=num_workers,
        )

I’ve found some tricks to implement a custom distributed sampler, but none of them work for custom distributed batch sampler. What can I do?

wayi · May 12, 2021, 6:15pm

@VitalyFedyunin for data loader questions