Session crashes! in making dataloader from numpy array

shangeth · February 24, 2019, 4:05pm

this is the shape of numpy array, (35628, 1, 16000)

i train/test split it
from sklearn.model_selection import train_test_split
x_train, x_test ,y_train, y_test = train_test_split(datan, label, test_size=0.2, shuffle=True, random_state=40)
(this works fine)
make dataset from split arrays
import torch.utils.data as utils

tensor_x = torch.stack([torch.Tensor(i) for i in list(x_train)])
tensor_y = torch.Tensor(y_train)
my_dataset = utils.TensorDataset(tensor_x, tensor_y)
trainloader = utils.DataLoader(my_dataset, batch_size = 1)

tensor_xte = torch.stack([torch.Tensor(i) for i in list(x_test)])
tensor_yte = torch.Tensor(y_test)
my_datasette = utils.TensorDataset(tensor_xte, tensor_yte)
testloader = utils.DataLoader(my_datasette, batch_size = 1)

ERROR HERE!!! (session crashes in google colab)

This works well for all other dataset, but its not working only for this dataset.

is it becoz the data is large?? if yes, how do i do it.

ptrblck · February 24, 2019, 9:08pm

Try to lower the size of your Dataset and run the code again.
I haven’t used Colab that often, but do you get any error message?

To lower the size, try to slice both numpy arrays:

datan = datan[:100]
label = label[:100]

shangeth · February 25, 2019, 12:41pm

hi @ptrblck , yeah it works normally with numpy array of shape(100,1,16000). How do i add large numpy to dataloader?

ptrblck · February 25, 2019, 1:10pm

It should work with any size as long as your system can handle it properly.
If you are using np.float32 values, the data should take approx 2GB of RAM.
Are you using multiple workers in your DataLoaders?
I’m not sure, what limitations Colab has on the RAM.

Also, could you try to use torch.from_numpy instead of the list comprehension with torch.stack?
The former approach would avoid a copy of the data.

shangeth · February 25, 2019, 1:31pm

I converted numpy array to float32 and used torch.from_numpy , it worked.

Thank you @ptrblck