Testing stopping at same row

Hello,

General question:

What do I look for if a model that is predicting a test set stops at the same row every time?

Nothing looks wrong to me. No data is missing. The image is there that’s supposed to be pulling. I tried removing that row, the same number index (new observation since the old was removed) is still getting held up.

I’m not sure how to trouble shoot.

I would start by disabling all multi-processing and multi-threading, if it’s used in your current script.
If that doesn’t help, check your memory usage and make sure your are not close to running our of memory.
E.g. if you are using the CPU and have a data leak (e.g. by storing the computation graph), your swap might be used, which might look like a hang, but is indeed just slow execution.

Recently we had a potential hang in this forum, where the validation loop was executed after a certain number of training iterations, and was unfortunately slow and didn’t print out anything, so that it again looked like a hang.

Thanks. I’m not using multi-threading and I’m executing on GPU. I do have it print out which row it’s on. Is there anything else it can print so I can see it doing something? I let the test run for 4 days at one point on the same row.

Could you try to execute this particular line in isolation and check, if the training would run?
What kind of data are you dealing with and what do the rows represent?

I ran 15 rows with the delinquent in the middle. It stopped again. I ran just the delinquent and it hangs. I ran my validation set through and constantly hangs at 68,040.

I exported the deliquent out to a .csv. The data is 6 numerical columns all centered and scaled and 8 categorical columns that are made into a categorical embedding layer. Finally, there is one image denoted in the data frame with the path to that image and I verified the images exist.

Here is the custom dataset (the training ran just fine):

class image_Dataset(Dataset):
    '''
    image class data set   
    
    '''
    def __init__(self, data, transform = None):
        '''
        Args:
        ------------------------------------------------------------
            data = dataframe
            image = column in dataframe with absolute path to the image
            label = column in dataframe that is the target classification variable
            numerical_columns =  numerical columns from data
            categorical_columns = categorical columns from data
            policy = ID variable
            
        '''
        self.image_frame = data
        self.transform = transform
        
    def __len__(self):
        return len(self.image_frame)
    
    def __getitem__(self, idx):
        if torch.is_tensor(idx):
            idx = idx.tolist()
         
        label = self.image_frame.loc[idx, 'target']
        
#         if self.image_frame.loc[idx, 'Roof'].isna() == True:
#             pic = np.ones(3, 224,224)
#             img = pic
#             image = torch.tensor(img)
#         else:
        pic = Path(self.image_frame.loc[idx, 'location'])
        img = Image.open(pic)
        image = self.transform(img)
        
        policy = self.image_frame.loc[idx, 'policy']
        
        numerical_data = self.image_frame.loc[idx, numerical_columns]
        
        numerical_data = torch.tensor(numerical_data, dtype = torch.float)
        
        for category in non_loca_cat_columns:
            self.image_frame[category] = self.image_frame[category].astype('category')
            
            self.image_frame[category] = self.image_frame[category].astype('category').cat.codes.values
        
            
        categorical_data = self.image_frame.loc[idx, non_loca_cat_columns]
        categorical_data = torch.tensor(categorical_data, dtype = torch.int64)
            
        return image, label, policy, categorical_data , numerical_data
'''

So the script hangs just in this special row.
Could you rip out the loading and processing code into a standalone script and try to see, if you are able to load and process this particular sample?
I would recommend to use an IDE and step line by line to isolate the hanging part (or add print statements to the script).

Sorry for the late reply. I took my kids camping. I ran the functions outside of the data loader and they still got hung at index 879. I don’t know why. I see the values, i see the picture. So removed that record by using test_hang = test.drop(test.index[-879]).

Now I get the following:

Traceback (most recent call last):

  File "C:\Users\JORDAN.HOWELL.GITDIR\Documents\GitHub\Inspection_Photo_Pytorch_Model\untitled0.py", line 555, in <module>
    for image, label, policy, categorical_data, numerical_data in test_loader_roof:

  File "C:\Users\JORDAN.HOWELL.GITDIR\AppData\Local\Continuum\anaconda3\envs\torch_env\lib\site-packages\torch\utils\data\dataloader.py", line 345, in __next__
    data = self._next_data()

  File "C:\Users\JORDAN.HOWELL.GITDIR\AppData\Local\Continuum\anaconda3\envs\torch_env\lib\site-packages\torch\utils\data\dataloader.py", line 385, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration

  File "C:\Users\JORDAN.HOWELL.GITDIR\AppData\Local\Continuum\anaconda3\envs\torch_env\lib\site-packages\torch\utils\data\_utils\fetch.py", line 47, in fetch
    return self.collate_fn(data)

  File "C:\Users\JORDAN.HOWELL.GITDIR\AppData\Local\Continuum\anaconda3\envs\torch_env\lib\site-packages\torch\utils\data\_utils\collate.py", line 79, in default_collate
    return [default_collate(samples) for samples in transposed]

  File "C:\Users\JORDAN.HOWELL.GITDIR\AppData\Local\Continuum\anaconda3\envs\torch_env\lib\site-packages\torch\utils\data\_utils\collate.py", line 79, in <listcomp>
    return [default_collate(samples) for samples in transposed]

  File "C:\Users\JORDAN.HOWELL.GITDIR\AppData\Local\Continuum\anaconda3\envs\torch_env\lib\site-packages\torch\utils\data\_utils\collate.py", line 81, in default_collate
    raise TypeError(default_collate_err_msg_format.format(elem_type))

TypeError: default_collate: batch must contain tensors, numpy arrays, numbers, dicts or lists; found <class 'pandas.core.series.Series'>

pandas.DataFrame.drop should return a DataFrame (not sure why the error mentions a Series object), so you might need to convert it to an array (via .values) or tensor.

So when I type:

pic = Path(test_hang.loc[879:880, 'location'].values[0])
image = self.train_transform(Image.open(pic))

in the console, I get a transformed tensor back.

When I run what is in my custom data set, I get the following error:

File "<ipython-input-23-9f4883f14d32>", line 175, in <module>
    for image, label, policy, categorical_data, numerical_data in test_loader_roof:

  File "C:\Users\JORDAN.HOWELL.GITDIR\AppData\Local\Continuum\anaconda3\envs\torch_env\lib\site-packages\torch\utils\data\dataloader.py", line 345, in __next__
    data = self._next_data()

  File "C:\Users\JORDAN.HOWELL.GITDIR\AppData\Local\Continuum\anaconda3\envs\torch_env\lib\site-packages\torch\utils\data\dataloader.py", line 385, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration

  File "C:\Users\JORDAN.HOWELL.GITDIR\AppData\Local\Continuum\anaconda3\envs\torch_env\lib\site-packages\torch\utils\data\_utils\fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]

  File "C:\Users\JORDAN.HOWELL.GITDIR\AppData\Local\Continuum\anaconda3\envs\torch_env\lib\site-packages\torch\utils\data\_utils\fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]

  File "<ipython-input-23-9f4883f14d32>", line 32, in __getitem__
    pic = Path(self.image_frame.loc[idx, 'location'].values[0])

AttributeError: 'str' object has no attribute 'values'

Here is my getitem once more:

class image_Dataset(Dataset):
    '''
    image class data set   
    
    '''
    def __init__(self, data, transform = None):
        '''
        Args:
        ------------------------------------------------------------
            data = dataframe
            image = column in dataframe with absolute path to the image
            label = column in dataframe that is the target classification variable
            numerical_columns =  numerical columns from data
            categorical_columns = categorical columns from data
            policy = ID variable
            
        '''
        self.image_frame = data
        self.transform = transform
        
    def __len__(self):
        return len(self.image_frame)
    
    def __getitem__(self, idx):
        if torch.is_tensor(idx):
            idx = idx.tolist()

         
        label = self.image_frame.loc[idx, 'target']
        label = np.asarray(label)

        pic = Path(self.image_frame.loc[idx, 'location'].values[0])
        image = self.transform(Image.open(pic))
        
        policy = self.image_frame.loc[idx, 'policy']
        policy = np.asarray(policy)

        
        numerical_data = self.image_frame.loc[idx, numerical_columns]

        
        numerical_data = torch.tensor(numerical_data, dtype = torch.float)

        
        for category in non_loca_cat_columns:
            self.image_frame[category] = self.image_frame[category].astype('category')
            
            self.image_frame[category] = self.image_frame[category].astype('category').cat.codes.values
        
            
        categorical_data = self.image_frame.loc[idx, non_loca_cat_columns]
        categorical_data = np.asarray(categorical_data)

        return image, label, policy, categorical_data , numerical_data

The image seems to be working in the console but no the custom data set. Is the index right under the getitem coded correct?

What is the difference between the run in the terminal and your custom dataset?
Apparently self.image_frame.loc[idx, 'location'] returns a str for the current idx.
I would recommend to check the image_frame in an IDE and make sure the location column contains the expected values, which can be properly indexed.
The previous error pointed to a pd.Series object, while now a str is returned, to maybe it would be easier to create a plain Python list containing all image paths?

Doesn’t the getitem have to have the

if torch.is_tensor(idx):
            idx = idx.tolist()

lines in it?

Disregard. I think I got it to work:

class image_Dataset(Dataset):
    '''
    image class data set   
    
    '''
    def __init__(self, data, transform = None):
        '''
        Args:
        ------------------------------------------------------------
            data = dataframe
            image = column in dataframe with absolute path to the image
            label = column in dataframe that is the target classification variable
            numerical_columns =  numerical columns from data
            categorical_columns = categorical columns from data
            policy = ID variable
            
        '''
        self.image_frame = data
        self.transform = transform
        
    def __len__(self):
        return len(self.image_frame)
    
    def __getitem__(self, idx):
    
        label = list(self.image_frame.loc[:, 'target'])

        #pic = Path(self.image_frame.loc[idx, 'location'])
        image = self.transform(Image.open(pathlib.Path(self.image_frame.loc[:, 'location'].values[0])))
        
        policy = list(self.image_frame.loc[:, 'policy'])
        
        numerical_data = np.asarray((self.image_frame.loc[:, numerical_columns]))

        
        #numerical_data = torch.tensor(numerical_data, dtype = torch.float)

        
        for category in non_loca_cat_columns:
            self.image_frame[category] = self.image_frame[category].astype('category')
            
            self.image_frame[category] = self.image_frame[category].astype('category').cat.codes.values
        
            
        categorical_data = np.asarray(self.image_frame.loc[:, non_loca_cat_columns])
        
        label = torch.FloatTensor(label)
        image = torch.FloatTensor(image)
        policy = torch.FloatTensor(policy)
        numerical_data = torch.FloatTensor(numerical_data)


        return image, label, policy, categorical_data , numerical_data