Pickling entire dataloader when n_workers > 0 causing GCP issues

JLenz · February 24, 2025, 3:08pm

I am trying to run a distributed training run using DDP, and PyTorch lightning, alongside the MosaicML streaming library. The goal is to directly stream shards of data from a GCP bucket into my dataloader(s). Right now the process works perfectly when num_workers = 0, but when I set the value higher, I get this error message from GCP:

line 387, in _get_iterator
    return _MultiProcessingDataLoaderIter(self)
line 1040, in __init__
    w.start()
line 121, in start
    self._popen = self._Popen(self)
line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
 line 284, in _Popen
    return Popen(process_obj)
  File , line 32, in __init__
    super().__init__(process_obj)
  File 
line 19, in __init__
    self._launch(process_obj)
line 47, in _launch
    reduction.dump(process_obj, fp)
 line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
, line 194, in __getstate__
    raise PicklingError(
_pickle.PicklingError: Pickling client objects is explicitly not supported.

From what I understand, such as from Torch distributed and num_workers>0 pickle error - #5 by farakiko there is an ‘issue’ where PyTorch tries to pickle each dataloader individually, which is incompatible with GCP buckets. Is there a known workaround for this issue?