Is there a pytorch method for parsing data? If not how do we "shape" or "format" data correctly?

Harsh_G · May 20, 2018, 11:25pm

I am not understanding how exactly pytorch will implement a set of data points to a given model.
Suppose I have a csv file: ‘train.csv’ with two attributes: ‘x’ and ‘y’ (where we supposse ‘x’ is independent, ‘y’ is independent) and there are 700 samples.

train = pd.read_csv(‘C:\Users\hgstr\Jupyter_Files\Data_Sets\linear_regression\train.csv’)

Then using pandas I extract the data into a dataframe. Suppose afterwards I put both attributes into their own separate Tensors.

x = torch.Tensor(train[‘x’])
y = torch.Tensor(train[‘y’])

How do I ‘reshape’ or ‘view’ x and y so that I don’t get any size mismatch or get unexpected data-type?

ptrblck · May 21, 2018, 12:22am

The Tensor shape depends on the operation you would like to perform.
In a simple case we would like to create a linear model. The nn.Linear layer takes in_features as an input and outputs out_features.
Your tensor should have the dimensions [batch_size, in_features] to be fed to the model.

If you would like to use images for a CNN, your input tensors should be of dimension [batch_size, channel, height, width].

Do you have a specific issue or size mismatch problem or is it a general quastion?

Harsh_G · May 21, 2018, 1:36am

To be specific, I have the following code:

import torch
import torch.nn as nn
from torch.autograd import Variable
import pandas as pd

class Linear_Reg(nn.Module):
    def __init__(self, inp_sz, out_sz):    
        super(Linear_Reg, self).__init__()
        self.linear = nn.Linear(inp_sz, out_sz)
        
    def forward(self, x):
        out = self.linear(x)
        return out
    
train = pd.read_csv('C:\\Users\\hgstr\\Jupyter_Files\\Data_Sets\\linear_regression\\train.csv')
test = pd.read_csv('C:\\Users\\hgstr\\Jupyter_Files\\Data_Sets\\linear_regression\\test.csv')

x_train = torch.Tensor(train['x'])
y_train = torch.Tensor(train['y'])

x_test = torch.Tensor(test['x'])
y_test = torch.Tensor(test['y'])

x_train = torch.Tensor(x_train)
y_train = torch.Tensor(y_train)
x_train = x_train.view(700,1)
y_train = y_train.view(700,1)
print(x_train.shape)
print(y_train.shape)

#================================
input_sz = 1;
output_sz = 1
epochs = 60
learning_rate = 0.001
#================================

model = Linear_Reg(input_sz, output_sz)
crit = nn.MSELoss()
opt = torch.optim.SGD(model.parameters(), learning_rate)

for e in range(epochs):
    
    opt.zero_grad()
    out = model(x_train)
    
    loss = crit(out, y_train)
    loss.backward()
    opt.step()
    
    print('epoch {}, loss {}'.format(e,loss.data[0]))

A simple linear regression program,
You will notice that I reshape the x_train and y_train tensors with the view function because I was getting a size mismatch problem before, but now I am getting a Nan loss.

the attributes of train.csv is just x, and y, with 700 samples.

Also, could you explain how to calculate batch size?

ptrblck · May 21, 2018, 9:07am

You chose the batch size based on your data, model, and hardware.
If you are using BatchNorm layers in your model, your batch size should not be smaller than ~64, since then you could observe a degrading performance, because the running stats could be noisy.
This of course also depends on the data you are using.

Another limitation could be the hardware. If you have a large model and would like to train it on the GPU, you could run into out of memory errors with a large batch size.

Usually you would use some mini batches with a batch size > 1 and < size of your dataset.

The NaN in your loss might come from another issue in your code. Could you check, that no NaNs are in your data?