Help for Sampling training data for multi-label classification task?!

mderakhshani · September 24, 2017, 8:38pm

Hi there!
I have got a dataset which each sample has multiple labels. My goal is to create a multi-label classifier on my dataset. There are 2 possible labels, label A and B. But each sample can have 19 * 19 * 5 labels based on some conditions.
In each sample, labels do not have equal numbers. I mean, in each sample there are some hints of labels imbalancy and the ratio of class label A with respect to B in whole dataset is 1000:1. It seems that the number of label B is not enough.
The accuracy result is 99.97 % but the recall measure is so small around 18 % which this is actually the effects of imbalance. One way to solve this phenomena is to use weighted loss function but it is not a satisfactory solution. I have decided to use oversampling which means select samples with high label B frequency. Could you please tell me how can I implement this type of sampling using sampler in pytorch, I mean sampling based on the frequency of label B in each sample?

Thanks

devansh20la · October 16, 2017, 3:57am

You can build your own sampler, where you have a permutation of the data from the class with higher frequency while over sampling from the class with lower frequency:
for example:
class_vector = [0,0,0,0,0,0,0,1,1,1]
class count = {0: 6 1: 3}
build a permutation of 0-6 with
np.random.permutation(6) and a over-sampling vector of class 1:
np.random.randint(7,9,6) # generates 6 integer numbers from 7,8,9

If you do not want data to repeat you can do a Stratified sampling:

helpful links:

github.com

pytorch/pytorch/blob/main/torch/utils/data/sampler.py

import torch
from torch import Tensor

from typing import Iterator, Iterable, Optional, Sequence, List, TypeVar, Generic, Sized, Union

__all__ = [
    "BatchSampler",
    "RandomSampler",
    "Sampler",
    "SequentialSampler",
    "SubsetRandomSampler",
    "WeightedRandomSampler",
]

T_co = TypeVar('T_co', covariant=True)


class Sampler(Generic[T_co]):
    r"""Base class for all Samplers.

This file has been truncated. show original

http://pytorch.org/tutorials/beginner/data_loading_tutorial.html#