Dcp.save straight to cloud storage

Is there an example somewhere of doing dcp.save straight to cloud storage, like s3 or google cloud storage?

Curious would these help?

Not sure but this PR may help you

That looks very promising, thanks for the link!

We tried this and it seems the writer is quite fast and the reader is quite slow, so then we switched back to a method where we download files from gs:// to node local /tmp storage before loading.

More specifically, each global_rank (according to device_mesh), downloads

f"__{global_rank}_0.distcp" 

from gs to /tmp, and every node downloads a copy of .metadata.

In this process we realized that for some reason, even if the DeviceMesh of the saving job is identical to that of the loading job, we get an error from global rank 7 like

file "__8_0.distcp" not found

(recall files 0-7 were download to this node’s /tmp)

If I understand correctly, rank 7 shouldn’t need this file in the identical mesh case.
It should only need __7_0.distcp.

Do I understand correctly?

Checks:

  • Verified that the metadata saved by GCSDistributedWriter and the baseline are identical!
  • Not yet clear to me whether this is a GCSDistributedWriter issue or a general dcp issue.

It’s not about DeviceMesh but how the model is sharded/parallelized. If the model is parallelized the same way as saving, then all ranks should only need to access these files when loading, 1) metadata file, 2) the files saved by the same rank.