Pybind ddp pickle

Lin1 · August 10, 2022, 12:36pm

Hello everyone,

A c++ preprocessor class is implemented with pybind11. The c++ class is imported in a custominzed dataset(torch.utils.data.Dataset) and one of its funtion is called by getitem. Once I tried to use DDP for multi-process training, the errror “cannot pickle preprocessor object” occurs. It works correctly, if i use one gpu without DDP(num_worker>2, num_batch>2).

It seems ddp need to pickle everything contained by the dataset class for sharing the dataloader between different process (link).

I also checked the official pybind11 doc about pickling support: link
But there is no cues about how to pickle a general c++ classes with pointers (fill numpy arrays).

Am I using ddp wrongly? or is there any solution can solve it?

Thanks,
Lin

rvarm1 · August 12, 2022, 5:44am

DDP by itself does not do any pickling or directly interact with Datset or even multiprocessing communication techniques.

This would fall into the domain of dataloader or multiprocessing cc @VitalyFedyunin

VitalyFedyunin · August 15, 2022, 4:03pm

Hard to say without seeing reproducible example. But I would guess that you have custom pybind objects inside of DataSet structures. Try defining __setstate__ and __getstate__ function for them. Also in most cases (if it is not shared memory), passing memory pointers between processes will not work and you need to serialize data.

Lin1 · August 23, 2022, 7:31am

Thanks @VitalyFedyunin . Problem sovled by defining setstate and getstate mentioned in pybind official doc

Just define setstate/getstate for those parameters used in the class’s construct funtion.