Can Pytorch Dataset be used for generating instances?

I have a problem instance class which take batch_size and data_property to randomly generate a number of batch_size problem instances based on the given data_property, as shown below:

class Problem(object):
  def __init__(self, batch_size, data_property):
    super(Problem, self).__init__()
    self.batch_size = batch_size
    self.fields_A = self.generate_A(batch_size, data_property)
    self.fields_B = self.generate_B(batch_size, data_property)

How can I make this class / adjust this class so it can be a Pytorch Dataset?

import torch

class ProblemDataset(torch.utils.data.Dataset):
  'Characterizes a dataset for PyTorch'
  def __init__(self, data_properties):
        'Initialization'
        self.data_properties = data_properties

  def __len__(self):
        # What to do here?

  def __getitem__(self, index):
        # What to do here?

If I understand your use case correctly, your Problem object yields a batch of samples without using any index, which makes sense since you are randomly generating the data.
In that case you could check, if an IterableDataset could make more sense, which would return an iterator it its __iter__ method.

Yes. And because we are producing one Problem object at a time with IterableDataset, is there any benefit of making this object into IterableDataset ? If I may add to the question, how will the object be scattered if we are using DistributedDataParallel?

You could wrap it into a DataLoader which would use multiple workers etc. once you use the PyTorch data classes.
In DistributedDataParallel you would use a DistributedSampler (probably not possible or necessary in your case as you are randomly generating the data). Besides that each process will create the Dataset or DataLoader. In the latter case each worker will create a copy of the passed Dataset.

I see, the latter case is using RPC so each process can solve separate problem and synchronize after every epoch.
Bu what if I just generate 1 batch at a time and return the fields (which is torch.Tensor)?

import torch

class ProblemDataset(torch.utils.data.Dataset):
  'Characterizes a dataset for PyTorch'
  def __init__(self, data_properties, batch_size):
        'Initialization'
        self.data_properties = data_properties
        self.data_property_idx = 0
        self.batch_size = batch_size

  def next_data_property(self):
        self.data_property_idx += 1

  def __len__(self):
       return self.batch_size

  def __getitem__(self, index):
        data_property = self.data_properties[self.data_property_idx]
        problem = Problem(batch_size=1, data_property)
        fields_A = problem.fields_A
        fields_B = problem.fields_B
        return {"fields_A": fields_A, "fields_B": fields_B}

Is this a proper Dataset? and can this be utilized with DistributedSampler and DistributedDataParallel? If I can use it this way, what should be returned by _len_ ?