We are going to work with millions of labels and store them in redis.
To reduce amount of memory used by currently unused labels/classes which we are feeding, we are filling them with None, but still each None is eating 16 bits of memory.
I have tried to fool pytorch and feed only current batch of labels instead of all millions of them, but it doesn’t work, because pytorch is throwing error about lengths mismatch of class and provided class_to_idx.
One of builtin assertions is failing.
/opt/conda/conda-bld/pytorch_1646756402876/work/aten/src/ATen/native/cuda/Loss.cu:257: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [0,0,0] Assertion `t >= 0 && t < n_classes` failed.
How we are changing dataset:
def change_dataset(dataset, redis_class_to_idx):
len = max(redis_class_to_idx.values()) + 1
actual_classes = [None] * len
for i,(path, class_idx) in enumerate(dataset.samples):
sample_class = dataset.classes[class_idx]
if sample_class not in redis_class_to_idx:
continue
actual_class_idx = redis_class_to_idx[sample_class]
current_class_idx = dataset.class_to_idx[sample_class]
current_class = dataset.classes[current_class_idx]
actual_classes[actual_class_idx] = current_class
dataset.samples[i] = (path, actual_class_idx)
dataset.targets[i] = actual_class_idx
dataset.classes = actual_classes
dataset.class_to_idx = redis_class_to_idx
return dataset
Is there a way to make it work?
For example, we are feeding batch of data with labels/classes
[None, None, None, None, None, None, None, None, None, 'b980bc1d-e7ef-11ec-97ee-b025aa41e7a9' , 'bbc22780-e8b2-11ec-8fea-0242ac120002', 'bbc22ad2-e8b2-11ec-8fea-0242ac120002']
so that their actual class_to_idx
will be for example:
{
'b980bc1d-e7ef-11ec-97ee-b025aa41e7a9': 112378568,
'bbc22780-e8b2-11ec-8fea-0242ac120002': 112378569,
'bbc22ad2-e8b2-11ec-8fea-0242ac120002': 112378570
}
and to feed it properly, we need a list with length 112378570 filled with None and only last 3 with real labels/classes.