Custom Dataset trying to read invalid idx

vhrn · July 8, 2024, 12:28am

I currently have the following Dataset class:

import numpy as np
import torch

class CVRPGraphDataset(torch.utils.data.Dataset):
    def __init__(self, data_file, sparse_factor=-1):
        self.data_file = data_file
        self.sparse_factor = sparse_factor
        self.file_lines = open(data_file).read().splitlines()
        print(f'Loaded "{data_file}" with {len(self.file_lines)} lines')

    def __len__(self):
        return len(self.file_lines)

    def get_example(self, idx):
        line = self.file_lines[idx]
        line = line.strip()
        capacity = float(line.split()[0])
        # Extract points
        points = line.split(" points ")[1].split(" demands ")[0]
        points = points.split()

        # Extract demands
        demands = line.split(" demands ")[1].split(" output ")[0]
        demands = demands.split()
        
        if len(demands) != len(points) / 2:
            raise ValueError(f"Number of demands {len(demands)} are different from number of points {len(points)}")
        
        points = np.array(
            [[float(points[i]), float(points[i + 1]), float(demands[i // 2])] for i in range(0, len(points), 2)]
        )
        
        # Extract route
        full_route = line.split(" output ")[1]
        full_route = full_route.split()
        full_route = np.array([int(t) for t in full_route[:-1]])
        if min(full_route) == 1:
            full_route -= 1
        
        return points, full_route

    def __getitem__(self, idx):
        points, route = self.get_example(idx)
        # Return a densely connected graph
        adj_matrix = np.zeros((points.shape[0], points.shape[0]))
        for i in range(route.shape[0] - 1):
            adj_matrix[route[i], route[i + 1]] = 1
        # return points, adj_matrix, route
        max_route_size = 2*points.shape[0] + 1
        pad_size = max_route_size - len(route)
        route = np.pad(route, pad_width=(0,pad_size), mode="constant", constant_values=-1)
        route_tensor = torch.from_numpy(route).long()
        result = (
            torch.LongTensor(np.array([idx], dtype=np.int64)),
            torch.from_numpy(points).float(),
            torch.from_numpy(adj_matrix).float(),
            route_tensor,
        )
        return result

After an update on a seemingly unrelated location (test_step of my LightningModule), I got an IndexError exception at the line:

def get_example(self, idx):
        line = self.file_lines[idx]

because idx is greater than the size of my Dataset. My question is: How can I debug this?

The print in the init shows that the right number of samples were loaded from the files:

print(f'Loaded "{data_file}" with {len(self.file_lines)} lines')

I also printed the contents of my files and they are ok. My DataLoaders uses the default sampler, with 12 workers and a batch_size=10 (I load 60/20/20 train/test/val samples).

ptrblck · July 9, 2024, 4:38pm

What does len(dataset) return as its __len__ will define the max. indices in the sampler?

vhrn · July 9, 2024, 5:19pm

I printed the len at the ```init`` of my model, and it seems correct, 60, 20 and 20.

    def __init__(self, param_args=None):
        super(CVRPModel, self).__init__(param_args=param_args, node_feature_only=False)

        self.train_dataset = CVRPGraphDataset(
            data_file=os.path.join(self.args.storage_path, self.args.training_split),
            sparse_factor=self.args.sparse_factor,
        )
        print(f"len train dataset = {len(self.train_dataset)}")

        self.test_dataset = CVRPGraphDataset(
            data_file=os.path.join(self.args.storage_path, self.args.test_split),
            sparse_factor=self.args.sparse_factor,
        )
        print(f"len test dataset = {len(self.test_dataset)}")

        self.validation_dataset = CVRPGraphDataset(
            data_file=os.path.join(self.args.storage_path, self.args.validation_split),
            sparse_factor=self.args.sparse_factor,
        )
        print(f"len val dataset = {len(self.validation_dataset)}")

Thanks for answering

ptrblck · July 9, 2024, 5:33pm

In this case I wouldn’t know what might be causing the issue since the len() of the dataset will be used in the sampler as seen here unless you specify a custom num_samples while creating the sampler explicitly.

vhrn · July 9, 2024, 6:04pm

Thank you.
I guess I found the issue. The validation DataLoader of my model was actually creating a subset of the validation dataset with a size parameter (self.args.validation_examples) that was greater than the validation dataset size.

def val_dataloader(self):
        batch_size = 1
        val_dataset = torch.utils.data.Subset(
            self.validation_dataset, range(self.args.validation_examples)
        )
        print("Validation dataset size:", len(val_dataset))
        val_dataloader = GraphDataLoader(
            val_dataset, batch_size=batch_size, shuffle=False,
            num_workers=self.args.num_workers
        )
        return val_dataloader

The len() of the Subset was considering the bigger list of indices. Perhaps an error could be thrown if the user passes and indices list greater than the dataset size in the Subset class. That would prevent errors.

class Subset(Dataset[T_co]):
    r"""
    Subset of a dataset at specified indices.

    Args:
        dataset (Dataset): The whole Dataset
        indices (sequence): Indices in the whole set selected for subset
    """

    dataset: Dataset[T_co]
    indices: Sequence[int]

    def __init__(self, dataset: Dataset[T_co], indices: Sequence[int]) -> None:
        self.dataset = dataset
        self.indices = indices

ptrblck · July 10, 2024, 12:55am

Not necessarily, as the passed indices can contain any duplicates (e.g. if you want to oversample a minority class). Adding an error would be too strict in these cases.
A check for valid indices, however, might be a good idea in case you want to contribute this.