Feed custom labels which can be out of range

We are going to work with millions of labels and store them in redis.
To reduce amount of memory used by currently unused labels/classes which we are feeding, we are filling them with None, but still each None is eating 16 bits of memory.

I have tried to fool pytorch and feed only current batch of labels instead of all millions of them, but it doesn’t work, because pytorch is throwing error about lengths mismatch of class and provided class_to_idx.
One of builtin assertions is failing.

/opt/conda/conda-bld/pytorch_1646756402876/work/aten/src/ATen/native/cuda/Loss.cu:257: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [0,0,0] Assertion `t >= 0 && t < n_classes` failed.

How we are changing dataset:

def change_dataset(dataset, redis_class_to_idx):
  len = max(redis_class_to_idx.values()) + 1
  actual_classes = [None] * len
  for i,(path, class_idx) in enumerate(dataset.samples):
    sample_class = dataset.classes[class_idx]
    if sample_class not in redis_class_to_idx:
      continue
    actual_class_idx = redis_class_to_idx[sample_class]
    current_class_idx = dataset.class_to_idx[sample_class]
    current_class = dataset.classes[current_class_idx]
    actual_classes[actual_class_idx] = current_class
    dataset.samples[i] = (path, actual_class_idx)
    dataset.targets[i] = actual_class_idx

  dataset.classes = actual_classes
  dataset.class_to_idx = redis_class_to_idx
  return dataset

Is there a way to make it work?
For example, we are feeding batch of data with labels/classes

[None, None, None, None, None, None, None, None, None, 'b980bc1d-e7ef-11ec-97ee-b025aa41e7a9' , 'bbc22780-e8b2-11ec-8fea-0242ac120002', 'bbc22ad2-e8b2-11ec-8fea-0242ac120002']

so that their actual class_to_idx will be for example:

{
   'b980bc1d-e7ef-11ec-97ee-b025aa41e7a9': 112378568,
   'bbc22780-e8b2-11ec-8fea-0242ac120002': 112378569,
   'bbc22ad2-e8b2-11ec-8fea-0242ac120002': 112378570
}

and to feed it properly, we need a list with length 112378570 filled with None and only last 3 with real labels/classes.

Basically we keep whole actual list of uuid_to_index hash in redis, but to avoid keeping them in app runtime memory and avoid feeding them all in each training cycle, we are modifying it like that.

I don’t really follow this part. Can you elaborate on it and provide some minimally run-able code example?

Aside from that, is it possible for you to read the data iterable-style? It seems like it should be possible but I can’t tell without knowing how your Dataset is defined.

If so, you can checkout IterDataPipe from our new library torchdata.

You can also use IterableDataset within torch.

1 Like

Thank you. I’m pretty new to python, so issue was pretty silly and was easily solved by replacing array with an object…