Exploding memory in torch.utils.data.DataLoader.__getitem__ when using polars dataframes

hi,

I have been having a continual problem with the __getitem__ method for DataLoaders for a specific use case. I am trying to do image segmentation on greyscale images. I have my images stored as .parquet files and my bounding boxes stored as .csv files. I use the polars library to open these files before passing them off to torch as tensors.

my problem has been that no matter what I seem to do, the memory usage explodes as a result of the polars.DataFrame.to_torch function. I have seen this issue: DataLoader num_workers > 0 causes CPU memory from parent process to be replicated in all worker processes · Issue #13246 · pytorch/pytorch · GitHub but even with num_workers=0, this behavior persists. below is a snippet of my __getitem__ function, along with some relevant utility functions.

    def __getitem__(self, index: int) -> tuple[torch.Tensor, torch.Tensor]:
        expr = pl.cum_sum_horizontal(pl.all().exclude("sensor").sum())
        cum_sum = self.lens.select(expr).unnest("cum_sum")
        cum_sum = cum_sum.transpose(
            include_header=True, header_name="days", column_names=["cum_sum"]
        )
        day_index = cum_sum.select((pl.col("cum_sum") > index).arg_max()).item()
        day = cum_sum.select(pl.col("days"))[day_index].item()
        adj_index = (
            index - cum_sum.select(pl.col("cum_sum"))[day_index - 1].item()
            if day_index != 0
            else index
        )
        path = self.paths.select(pl.col(day))[adj_index].item()
        next_path = self.paths.select(pl.col(day))[adj_index + 1].item()
        day_and_path = day + "/" + path
        day_and_next_path = day + "/" + next_path
        samp, next_samp = (
            create_data_from_df(pl.read_parquet(self.data_path + day_and_path + ".parquet")),
            create_data_from_df(pl.read_parquet(self.data_path + day_and_next_path + ".parquet")),
        )
        label = create_diff_bboxes_from_df(
            pl.read_csv(self.label_path + day_and_path + "_match.csv"),
            pl.read_csv(self.label_path + day_and_next_path + "_match.csv"),
        )
        diff_samp = sigma_clip(next_samp - samp)
        return diff_samp, label

in this example, self.lens contains the number of samples for each sensor, sorted by day. self.paths contains all of the paths to the files as strings within the self.data_path and self.label_path parent directories. these all used to be pathlib.Path objects, but I eliminated those in case they were the cause but with no luck. additionally, different sensors can have different numbers of samples, even between days, which is why there is a need to adjust the index.

relevant utility functions are below:

def _select_and_sort_coordinates(df: pl.DataFrame) -> pl.DataFrame:
    return df.select(
        pl.min_horizontal(["x1", "x2"]).alias("x2"),
        pl.min_horizontal(["y1", "y2"]).alias("y2"),
        pl.max_horizontal(["x1", "x2"]).alias("x1"),
        pl.max_horizontal(["y1", "y2"]).alias("y1"),
    ).cast(pl.Float32)

def create_diff_bboxes_from_df(df1: pl.DataFrame, df2: pl.DataFrame) -> torch.Tensor:
    # The columns might not be sorted so we have to do that ourselves
    df1, df2 = _select_and_sort_coordinates(df1), _select_and_sort_coordinates(df2)
    overall = pl.concat([df1, df2], how="vertical")
    return overall.to_torch(dtype=pl.Float32)

def create_data_from_df(df: pl.DataFrame) -> torch.Tensor:
    return df.to_torch(dtype=pl.Float32).unsqueeze(dim=0)

any help would be appreciated here. thank you!