Unexplained pytorch datatype

joepareti54 · May 12, 2022, 5:02pm

in this program, I start with a numpy array with default initialization. The array is then used within the Pytorch Dataset class, and then the dataset is used in Dataloader. Without specifying anything, the data turns out to be tensor, dtype=torch.float64. Why?

import numpy as np
import torch
import torch.nn.functional as F
from torch import nn, optim
from torch.utils.data import Dataset
from torch.utils.data import DataLoader
from torchvision import transforms
np.random.seed(seed=115)
X = np.random.rand(10,3)
class Wine(Dataset):
    def __init__(self,X):
        self.x = X[:, 1:]
        self.y = X[:, 0]
        self.num_samples = X.shape[0]

    def __len__(self):
        return(self.num_samples)

    def __getitem__(self, index):
        return (self.x[index] , self.y[index] )
#        
WineDS = Wine(X)
#
print('size of train ds ',len(WineDS))
train_dataloader = DataLoader(WineDS, batch_size=2, shuffle=False)
#
train_features , train_labels = next(iter(train_dataloader))
print(train_features)
print('----')
print(train_labels)

InnovArul · May 12, 2022, 5:22pm

It is because the default collation function is used in Dataloader if you do not provide one (using collate_fn option in the DataLoader). You can provide your own function to collate_fn if you want to handle the batch of data in some other way.

github.com

pytorch/pytorch/blob/e36a8c1f137af6de7fc5f8d2160f0bb355738824/torch/utils/data/_utils/collate.py#L84-L130

      
        
            def default_collate(batch):
                r"""
                    Function that takes in a batch of data and puts the elements within the batch
                    into a tensor with an additional outer dimension - batch size. The exact output type can be
                    a :class:`torch.Tensor`, a `Sequence` of :class:`torch.Tensor`, a
                    Collection of :class:`torch.Tensor`, or left unchanged, depending on the input type.
                    This is used as the default function for collation when
                    `batch_size` or `batch_sampler` is defined in :class:`~torch.utils.data.DataLoader`.
            
            
        Here is the general input type (based on the type of the element within the batch) to output type mapping:
            
            
            * :class:`torch.Tensor` -> :class:`torch.Tensor` (with an added outer dimension batch size)
                        * NumPy Arrays -> :class:`torch.Tensor`
                        * `float` -> :class:`torch.Tensor`
                        * `int` -> :class:`torch.Tensor`
                        * `str` -> `str` (unchanged)
                        * `bytes` -> `bytes` (unchanged)
                        * `Mapping[K, V_i]` -> `Mapping[K, default_collate([V_1, V_2, ...])]`
                        * `NamedTuple[V1_i, V2_i, ...]` -> `NamedTuple[default_collate([V1_1, V1_2, ...]),
                          default_collate([V2_1, V2_2, ...]), ...]`

This file has been truncated. show original

joepareti54 · May 12, 2022, 6:53pm

thank you; how would you change the default so that the dataloader returns 32 bit data instead of 64?

InnovArul · May 12, 2022, 10:25pm

A straightforward way is to call .float() on train_features and train_labels.

train_features = train_features.float()
train_labels = train_labels.float()

If you want to do this at collation level, you can write a custom collate function and pass it to DataLoader as below.

def collate_fn(batch):
    X, y = [], []
    for data, label in batch:
        X.append(torch.tensor(data))
        y.append(torch.tensor(label))
    
    return torch.stack(X).float(), torch.stack(y).float()

...
train_dataloader = DataLoader(WineDS, batch_size=2, shuffle=False, collate_fn=collate_fn)

diegoaichele · May 13, 2022, 4:24am

Something like this?

import numpy as np
import torch
import torch.nn.functional as F
from torch import nn, optim
from torch.utils.data import Dataset
from torch.utils.data import DataLoader
from torchvision import transforms
np.random.seed(seed=115)
X = np.random.rand(10,3)
class Wine(Dataset):
    def __init__(self,X):
        self.x = X[:, 1:]
        self.y = X[:, 0]
        self.num_samples = X.shape[0]

    def __len__(self):
        return(self.num_samples)

    def __getitem__(self, index):
        x = torch.tensor(self.x[index] , dtype= torch.float32)
        y = torch.tensor(self.y[index] , dtype= torch.float32) 
        return x, y
#        
WineDS = Wine(X)
#
print('size of train ds ',len(WineDS))
train_dataloader = DataLoader(WineDS, batch_size=2, shuffle=False)
#
train_features , train_labels = next(iter(train_dataloader))
print(train_features)
print(train_features.dtype)
print('----')
print(train_labels)
print(train_labels.dtype)

// size of train ds 10
// tensor([[0.7028, 0.4137],
// [0.7279, 0.1905]])
// torch.float32
// ----
// tensor([0.1961, 0.5768])
// torch.float32