Preprocessing float data CSVs for evaluation using Resnet

GreatScherzo · May 27, 2021, 2:49am

Hi! I’m currently working on evaluating float32 data in CSVs using Wide-Resnet.

Data Explanation

The CSV data also have nan values within them (which is represented with “-”). The data doesn’t have any specific min or max range to them, unlike byte images, where the range is 0-255.
Here’s 2 snippets of a part of the data.

Preprocessing Explanation

I evaluated the data by reading them to a dataframe using pandas library, and converted the “-” to nan value. I then converted the dataframe to a PIL float image and preprocessed it using torchvision.
The code below shows the preprocess used.

Preprocessing Code

def __getitem__(self, idx):
    # get path of csv file
    x = self.x[idx]

    # convert csv to dataframe
    x = pd.read_csv(x, delimiter=',', skiprows=5, na_values="-", dtype=np.float32)

    warnings.filterwarnings("ignore")
    x1 = x.to_numpy()

    # convert to image and apply resize and other preprocess
    x2 = self.transform_x(x1)

    # pre-normalization
    scaler = MinMaxScaler()
    x3 = scaler.fit_transform(x2)

    # replacing nan with lowest value
    x3[np.isnan(x3)] = 0
    warnings.filterwarnings("default")

    # Create a fake 3 channel tensor by copying and stacking the data
    x4 = T.ToTensor()(np.array(x3))
    x5 = torch.cat([x4, x4, x4], dim=0)

    x6 = T.Normalize(self.ImageNetNormParam["mean"],
                     self.ImageNetNormParam["std"],)(x5)

    # Tensor type seems to change to double along the way, so it
    # is once again reconverted to float
    x7 = x6.to(dtype=torch.float)

    # it turns to a tensor and returned
    return x7, y

    self.ImageNetNormParam = {"mean": [0.485, 0.456, 0.406],
                              "std": [0.229, 0.224, 0.225]
                              }
    self.resize = (416, 288)

    self.transform_x = T.Compose([T.ToPILImage(),
                                  T.Resize(resize, Image.ANTIALIAS),
                                  ])

Problem Statement

My questions are as below:

1

In my preprocessing, I copied and stacked my 1 channel tensor to create a fake 3 channel tensor so that Resnet could read it.
Is it OK to do this? Is there a more correct way to evaluate 1chn data using Resnet? Are there pretrained architectures where it accepts 1 channel data?

2

In my preprocessing, I replaced the nan value in the data with 0. However, this will overlap with actual 0 values existing within my data.
Is there a way to mask nan data such as that Resnet would read this data as not-effective values, or are there other ways to approach this?

3

My normalization used Imagenets’s stdev and men preset to normalize. However, my type of data isn’t the same such as the ones in Imagenet. Is that OK, or should I calculate my own stdev and mean based on my data?

4

I used scikit-learn’s min max scaler to scale my data to 0-1. However, I would like so that my data wouldn’t be scale based on each individual data’s min and max, but with the min and max that is based of all the data. This is the same as scaling byte images, where every image has a determined min max which is 0-255.

However, problem arises when I evaluate multiple batches of these data. For example, supposedly for the 1st batch of float csv data, I get a min max of -5to3. I then used this as a standard and scaled the 1st batch and 2nd batch with the standard min max. But for the 3rd batch, the min was -50 and under the standard min. This would lead to an improper scaling of the 3rd batch.

Can Resnet still read the improperly scaled batch? Or is there a way to scale float CSVs (which doesn’t have any min max range) properly?

Thank you very much for taking your time in answering my questions.