Why does PyTorch + Ray Tune save the data twice?

In the beginners tutorial there is the code (comments are mine.)

        # makes a tmp dir in /tmp/adfgsdf in Unix.
        with tempfile.TemporaryDirectory() as checkpoint_dir:
            data_path = Path(checkpoint_dir) / "data.pkl"
            # tmp dir and contents are removed after
            # upper context exits.
            with data_path.open("wb") as fp:
                pickle.dump(checkpoint_data, fp)
            
            # loads the data again, and saves it persistent?
            checkpoint = Checkpoint.from_directory(checkpoint_dir)
            train.report(
                {"loss": val_loss / val_steps, "accuracy": correct / total},
                checkpoint=checkpoint,
            )

Why does PyTorch/RayTune save the data twice, one with dump then with report ?


Reply from ChatGPT

The key part from ChatGPT answer (If i paste the above description) is:

When Checkpoint.from_directory(checkpoint_dir) is called, Ray Tune encapsulates all the contents of checkpoint_dir (including the serialized data.pkl file) into a checkpoint object. This abstraction allows Ray Tune to integrate with its distributed training and experimentation framework seamlessly.

If you only use train.report to save checkpoints, Ray Tune might not have sufficient information to reconstruct or manage the experiment properly, especially in complex training pipelines.

I’m not very convinced by this explanation, but it may be correct. I find it hard to see why there isn’t a way to pass the checkpoint data directly to report(...)