hi,
I have been having a continual problem with the __getitem__ method for DataLoaders for a specific use case. I am trying to do image segmentation on greyscale images. I have my images stored as .parquet files and my bounding boxes stored as .csv files. I use the polars library to open these files before passing them off to torch as tensors.
my problem has been that no matter what I seem to do, the memory usage explodes as a result of the polars.DataFrame.to_torch function. I have seen this issue: DataLoader num_workers > 0 causes CPU memory from parent process to be replicated in all worker processes · Issue #13246 · pytorch/pytorch · GitHub but even with num_workers=0, this behavior persists. below is a snippet of my __getitem__ function, along with some relevant utility functions.
def __getitem__(self, index: int) -> tuple[torch.Tensor, torch.Tensor]:
expr = pl.cum_sum_horizontal(pl.all().exclude("sensor").sum())
cum_sum = self.lens.select(expr).unnest("cum_sum")
cum_sum = cum_sum.transpose(
include_header=True, header_name="days", column_names=["cum_sum"]
)
day_index = cum_sum.select((pl.col("cum_sum") > index).arg_max()).item()
day = cum_sum.select(pl.col("days"))[day_index].item()
adj_index = (
index - cum_sum.select(pl.col("cum_sum"))[day_index - 1].item()
if day_index != 0
else index
)
path = self.paths.select(pl.col(day))[adj_index].item()
next_path = self.paths.select(pl.col(day))[adj_index + 1].item()
day_and_path = day + "/" + path
day_and_next_path = day + "/" + next_path
samp, next_samp = (
create_data_from_df(pl.read_parquet(self.data_path + day_and_path + ".parquet")),
create_data_from_df(pl.read_parquet(self.data_path + day_and_next_path + ".parquet")),
)
label = create_diff_bboxes_from_df(
pl.read_csv(self.label_path + day_and_path + "_match.csv"),
pl.read_csv(self.label_path + day_and_next_path + "_match.csv"),
)
diff_samp = sigma_clip(next_samp - samp)
return diff_samp, label
in this example, self.lens contains the number of samples for each sensor, sorted by day. self.paths contains all of the paths to the files as strings within the self.data_path and self.label_path parent directories. these all used to be pathlib.Path objects, but I eliminated those in case they were the cause but with no luck. additionally, different sensors can have different numbers of samples, even between days, which is why there is a need to adjust the index.
relevant utility functions are below:
def _select_and_sort_coordinates(df: pl.DataFrame) -> pl.DataFrame:
return df.select(
pl.min_horizontal(["x1", "x2"]).alias("x2"),
pl.min_horizontal(["y1", "y2"]).alias("y2"),
pl.max_horizontal(["x1", "x2"]).alias("x1"),
pl.max_horizontal(["y1", "y2"]).alias("y1"),
).cast(pl.Float32)
def create_diff_bboxes_from_df(df1: pl.DataFrame, df2: pl.DataFrame) -> torch.Tensor:
# The columns might not be sorted so we have to do that ourselves
df1, df2 = _select_and_sort_coordinates(df1), _select_and_sort_coordinates(df2)
overall = pl.concat([df1, df2], how="vertical")
return overall.to_torch(dtype=pl.Float32)
def create_data_from_df(df: pl.DataFrame) -> torch.Tensor:
return df.to_torch(dtype=pl.Float32).unsqueeze(dim=0)
any help would be appreciated here. thank you!