Bug when "numpy.str_" occur in dataloader?

If, for some reason, my sample contains data of type numpy.str_ instead of str, e.g., type(sample['filepath']) == type(numpy.str_) then when it passed to DataLoader, a KeyError would occur. I checked and figure that maybe I found a bug. Consider function default_collate below:

def default_collate(batch):
    "Puts each data field into a tensor with outer dimension batch size"
    if torch.is_tensor(batch[0]):
        out = None
        if _use_shared_memory:
            # If we're in a background process, concatenate directly into a
            # shared memory tensor to avoid an extra copy
            numel = sum([x.numel() for x in batch])
            storage = batch[0].storage()._new_shared(numel)
            out = batch[0].new(storage)
        return torch.stack(batch, 0, out=out)
    elif type(batch[0]).__module__ == 'numpy':
        elem = batch[0]
        if type(elem).__name__ == 'ndarray':
            return torch.stack([torch.from_numpy(b) for b in batch], 0)
        if elem.shape == ():  # scalars
            py_type = float if elem.dtype.name.startswith('float') else int
            return numpy_type_map[elem.dtype.name](list(map(py_type, batch)))
    elif isinstance(batch[0], int):
        return torch.LongTensor(batch)
    elif isinstance(batch[0], float):
        return torch.DoubleTensor(batch)
    elif isinstance(batch[0], string_classes):
        return batch
    elif isinstance(batch[0], collections.Mapping):
        return {key: default_collate([d[key] for d in batch]) for key in batch[0]}
    elif isinstance(batch[0], collections.Sequence):
        transposed = zip(*batch)
        return [default_collate(samples) for samples in transposed]

    raise TypeError(("batch must contain tensors, numbers, dicts or lists; found {}"
                     .format(type(batch[0]))))

When the data is a numpy.str_, it would pass elif type(batch[0]).__module__ == 'numpy': and then treated like either a ‘ndarray’ or a ‘float’ scale. What I should do is set my sample['filepath'] = str(sample['filepath']), but it would be nice if the function could at least tell people about it.

Is this really a bug ? Also, next time I found things like this, should I open an issue at github or came here first ? Thanks so much.

This does seem like a bug, at the very least it should give a better error message. Do open an issue, thank you.

Saying that, DataLoader can take a custom collate_fn that you can define, that collates batches as you prefer (instead of using default_collate).

1 Like

Thank you. I will consider using collate_fn in the future.