Is there a pytorch method for parsing data? If not how do we "shape" or "format" data correctly?

I am not understanding how exactly pytorch will implement a set of data points to a given model.
Suppose I have a csv file: ‘train.csv’ with two attributes: ‘x’ and ‘y’ (where we supposse ‘x’ is independent, ‘y’ is independent) and there are 700 samples.

train = pd.read_csv(‘C:\Users\hgstr\Jupyter_Files\Data_Sets\linear_regression\train.csv’)

Then using pandas I extract the data into a dataframe. Suppose afterwards I put both attributes into their own separate Tensors.

x = torch.Tensor(train[‘x’])
y = torch.Tensor(train[‘y’])

How do I ‘reshape’ or ‘view’ x and y so that I don’t get any size mismatch or get unexpected data-type?

The Tensor shape depends on the operation you would like to perform.
In a simple case we would like to create a linear model. The nn.Linear layer takes in_features as an input and outputs out_features.
Your tensor should have the dimensions [batch_size, in_features] to be fed to the model.

If you would like to use images for a CNN, your input tensors should be of dimension [batch_size, channel, height, width].

Do you have a specific issue or size mismatch problem or is it a general quastion?

To be specific, I have the following code:

import torch
import torch.nn as nn
from torch.autograd import Variable
import pandas as pd

class Linear_Reg(nn.Module):
    def __init__(self, inp_sz, out_sz):    
        super(Linear_Reg, self).__init__()
        self.linear = nn.Linear(inp_sz, out_sz)
        
    def forward(self, x):
        out = self.linear(x)
        return out
    
train = pd.read_csv('C:\\Users\\hgstr\\Jupyter_Files\\Data_Sets\\linear_regression\\train.csv')
test = pd.read_csv('C:\\Users\\hgstr\\Jupyter_Files\\Data_Sets\\linear_regression\\test.csv')

x_train = torch.Tensor(train['x'])
y_train = torch.Tensor(train['y'])

x_test = torch.Tensor(test['x'])
y_test = torch.Tensor(test['y'])

x_train = torch.Tensor(x_train)
y_train = torch.Tensor(y_train)
x_train = x_train.view(700,1)
y_train = y_train.view(700,1)
print(x_train.shape)
print(y_train.shape)

#================================
input_sz = 1;
output_sz = 1
epochs = 60
learning_rate = 0.001
#================================

model = Linear_Reg(input_sz, output_sz)
crit = nn.MSELoss()
opt = torch.optim.SGD(model.parameters(), learning_rate)

for e in range(epochs):
    
    opt.zero_grad()
    out = model(x_train)
    
    loss = crit(out, y_train)
    loss.backward()
    opt.step()
    
    print('epoch {}, loss {}'.format(e,loss.data[0]))

A simple linear regression program,
You will notice that I reshape the x_train and y_train tensors with the view function because I was getting a size mismatch problem before, but now I am getting a Nan loss.

the attributes of train.csv is just x, and y, with 700 samples.

Also, could you explain how to calculate batch size?

You chose the batch size based on your data, model, and hardware.
If you are using BatchNorm layers in your model, your batch size should not be smaller than ~64, since then you could observe a degrading performance, because the running stats could be noisy.
This of course also depends on the data you are using.

Another limitation could be the hardware. If you have a large model and would like to train it on the GPU, you could run into out of memory errors with a large batch size.

Usually you would use some mini batches with a batch size > 1 and < size of your dataset.

The NaN in your loss might come from another issue in your code. Could you check, that no NaNs are in your data?

1 Like