I have been struggling with an issue regarding torch multiprocessing. Currently, I have a whole training pipeline made up of three classes: a Trainer which defines how training should be accomplished, and two model classes. The first is a base_model class which defines a lot of the boiler-plate functions (training loops, saving, loading, etc.) and the second is the model architecture class which inherits the base class as its parent. I have been trying to implement a distributed training method in the base_class, but have keep getting this error:
ValueError: bad value(s) in fds_to_keep.
Edit: I have found that the problem stems from the fact that when my architecture is built I take the state_dict of the model. When state_dict is changed to
net.state_dict(keep_vars=True), it works. Why would this cause the issue?