Hi,
I am trying to concatenate dataset in a such a way that which will also able to return path.
Hi,
I am trying to concatenate dataset in a such a way that which will also able to return path.
Hi,
I write a simple demo for you, just use tensor_data, you can have a modification on it to meet your needs.
class custom_dataset1(torch.utils.data.Dataset):
def __init__(self):
super(custom_dataset1, self).__init__()
self.tensor_data = torch.tensor([1., 2., 3., 4., 5.])
def __getitem__(self, index):
return self.tensor_data[index], index
def __len__(self):
return len(self.tensor_data)
class custom_dataset2(torch.utils.data.Dataset):
def __init__(self):
super(custom_dataset2, self).__init__()
self.tensor_data = torch.tensor([6., 7., 8., 9., 10.])
def __getitem__(self, index):
return self.tensor_data[index], index
def __len__(self):
return len(self.tensor_data)
dataset1 = custom_dataset1()
dataset2 = custom_dataset2()
concate_dataset = torch.utils.data.ConcatDataset([dataset1, dataset2])
value ,index = next(iter(concate_dataset))
print(value, index)
you can change index
in to path
, then using corresponding loss function.
If we want to combine two imbalanced datasets and get balanced samples, I think we could use ConcatDataset and pass a WeightedRandomSampler to the DataLoader
dataset1 = custom_dataset1()
dataset2 = custom_dataset2()
concat_dataset = torch.utils.data.ConcatDataset([dataset1, dataset2])
dataloader = torch.utils.data.DataLoader(concat_dataset, batch_size= bs, weighted_sampler)
I am looking for an answer for this do you have any idea about it? and thank you for your help.
Thanks a lot. Really helped me with training my CycleGAN network.
Maybe we can solve this by:
class ConcatDataset(torch.utils.data.Dataset):
def __init__(self, *datasets):
self.datasets = datasets
def __getitem__(self, i):
return tuple(d[i %len(d)] for d in self.datasets)
def __len__(self):
return max(len(d) for d in self.datasets)
train_loader = torch.utils.data.DataLoader(
ConcatDataset(
datasets.ImageFolder(traindir_A),
datasets.ImageFolder(traindir_B)
),
batch_size=args.batch_size, shuffle=True,
num_workers=args.workers, pin_memory=True)
for i, (input, target) in enumerate(train_loader):
...
Question #1: When I try this, it loops through the shorter dataset in the group. So if dataset A is 100 and dataset B is 1000 images and if I call ConcatDataset(dataset_A, dataset_B)[100]
, I’ll get a tuple with the contents filled by(dataset_A[0], dataset_B[100])
. Does this make sense when putting this into a loader for training? Won’t I overfit on the smaller dataset?
Question #2: Now we don’t just have (input, target)
, we have ((input_1, target_1), (input_2, target_2))
.
How do I train when the loader gives me a list of lists like this? Do I select randomly from the first list for my input? Or is this where weighted sampling comes in?
I also have the same question.Please let me know what is the best way to solve this problem. I dont think we can use weighted random sampling here if yes please let me know how can i do it?
Hello I’m facing a similar problem and none of the solutions above are fitting. I’m running semi-supervised experiments and I’d like each batch to contain say n observations from the labelled data set set and say m observations from the unlabelled data set. Of course each of these go through different objective functions but are added together before making and optimization set. Thus I would really need to have loader formatted to sample from 2 two different data set at a time. Anyone know a ingenious to do so ?
class BalancedConcatDataset(torch.utils.data.Dataset):
def __init__(self, *datasets):
self.datasets = datasets
self.max_len = max(len(d) for d in self.datasets)
self.min_len = min(len(d) for d in self.datasets)
def __getitem__(self, i):
return tuple(d[i % len(d)] for d in self.datasets)
def masks_collate(self, batch):
# Only image - mask
images, masks = [], []
for item in range(len(batch)):
for c_dataset in range(len(batch[item])):
images.append(batch[item][c_dataset][0])
masks.append(batch[item][c_dataset][1])
images = torch.stack(images)
masks = torch.stack(masks)
return images, masks
def __len__(self):
return self.max_len
It would be masks or labels
Hi @apaszke when i use this function it transforms my dataset which is combined of tensors to lists is there a solution for this ??
Any luck on a solution @MarkovChain? Currently I pass multiple datasets to CycleConcatDataset
and then define a dataloader
on it with a single batch size. This essentially will batch all the datasets and will cycle through the shorter ones until the longest dataset finishes.
In my use case (semi supervised and domain adaptation) I would like to keep the parameter updates as balanced as possible. This cycling method is a bit unfair as the shorter datasets update the parameters more.
I think one way to help my particular use case is to somehow use different batch sizes for each dataset.
class CycleConcatDataset(data.Dataset):
'''Dataset wrapping multiple train datasets
Parameters
----------
*datasets : sequence of torch.utils.data.Dataset
Datasets to be concatenated and cycled
'''
def __init__(self, *datasets):
self.datasets = datasets
def __getitem__(self, i):
result = []
for dataset in self.datasets:
cycled_i = i % len(dataset)
result.append(dataset[cycled_i])
return tuple(result)
def __len__(self):
return max(len(d) for d in self.datasets)
If you are looking for using multiple dataloaders at the same time this should work
class cat_dataloaders():
"""Class to concatenate multiple dataloaders"""
def __init__(self, dataloaders):
self.dataloaders = dataloaders
len(self.dataloaders)
def __iter__(self):
self.loader_iter = []
for data_loader in self.dataloaders:
self.loader_iter.append(iter(data_loader))
return self
def __next__(self):
out = []
for data_iter in self.loader_iter:
out.append(next(data_iter)) # may raise StopIteration
return tuple(out)
Here is a quick example
class DEBUG_dataset(Dataset):
def __init__(self,alpha):
self.d = (torch.arange(20) + 1) * alpha
def __len__(self):
return self.d.shape[0]
def __getitem__(self, index):
return self.d[index]
train_dl1 = DataLoader(DEBUG_dataset(10), batch_size = 4,num_workers = 0 , shuffle=True)
train_dl2 = DataLoader(DEBUG_dataset(1), batch_size = 4,num_workers = 0 , shuffle=True)
tmp = cat_dataloaders([train_dl1,train_dl2])
for x in tmp:
print(x)
output is
(tensor([140, 160, 130, 90]), tensor([ 5, 10, 8, 9]))
(tensor([120, 30, 170, 70]), tensor([15, 17, 18, 7]))
(tensor([180, 50, 190, 80]), tensor([ 6, 14, 3, 2]))
(tensor([ 10, 40, 150, 100]), tensor([11, 13, 4, 1]))
(tensor([ 60, 200, 110, 20]), tensor([19, 12, 20, 16]))
Bro, thanks for saving my time lol.
import numpy as np
def cycle(iterable):
while True:
for x in iterable:
yield x
class MultiTaskDataloader(object):
def __init__(self, tau=1.0, **dataloaders):
self.dataloaders = dataloaders
Z = sum(pow(v, tau) for v in self.dataloader_sizes.values())
self.tasknames, self.sampling_weights = zip(*((k, pow(v, tau) / Z) for k, v in self.dataloader_sizes.items()))
self.dataiters = {k: cycle(v) for k, v in dataloaders.items()}
@property
def dataloader_sizes(self):
if not hasattr(self, '_dataloader_sizes'):
self._dataloader_sizes = {k: len(v) for k, v in self.dataloaders.items()}
return self._dataloader_sizes
def __len__(self):
return sum(v for k, v in self.dataloader_sizes.items())
def __iter__(self):
for i in range(len(self)):
taskname = np.random.choice(self.tasknames, p=self.sampling_weights)
dataiter = self.dataiters[taskname]
batch = next(dataiter)
batch['task'] = taskname
yield batch
Hi,
could you provide me with how one can define distributed Sampler for the MultiTaskDataloader that @AlongWY wrote? This is basically for training a model across multiple TPU cores, where data needs to be distributed over multiple cores. thanks a lot in advance.
Hi there,
could you provide an example, in case this was not iterable dataset, but was mapping based on, how would the sampling be done? thanks
HI I found a much easier solution and wanted to share here
dataset_3 = torch.utils.data.ConcatDataset((dataset_1,dataset_2))
each of the dataset are of type torch.utils.data.dataset.Dataset
this command helped me to concatenate both the dataset and later prepare a data loader from it.
len(dataset_1)=200
len(dataset_2)=300
len(dataset_3)=500
Thank you, it really helps.
im getting
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
if __name__ == '__main__':
freeze_support()
...
The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable.