In the beginners tutorial there is the code (comments are mine.)
# makes a tmp dir in /tmp/adfgsdf in Unix.
with tempfile.TemporaryDirectory() as checkpoint_dir:
data_path = Path(checkpoint_dir) / "data.pkl"
# tmp dir and contents are removed after
# upper context exits.
with data_path.open("wb") as fp:
pickle.dump(checkpoint_data, fp)
# loads the data again, and saves it persistent?
checkpoint = Checkpoint.from_directory(checkpoint_dir)
train.report(
{"loss": val_loss / val_steps, "accuracy": correct / total},
checkpoint=checkpoint,
)
Why does PyTorch/RayTune save the data twice, one with dump
then with report
?
Reply from ChatGPT
The key part from ChatGPT answer (If i paste the above description) is:
When
Checkpoint.from_directory(checkpoint_dir)
is called, Ray Tune encapsulates all the contents ofcheckpoint_dir
(including the serializeddata.pkl
file) into a checkpoint object. This abstraction allows Ray Tune to integrate with its distributed training and experimentation framework seamlessly.
If you only use
train.report
to save checkpoints, Ray Tune might not have sufficient information to reconstruct or manage the experiment properly, especially in complex training pipelines.
I’m not very convinced by this explanation, but it may be correct. I find it hard to see why there isn’t a way to pass the checkpoint data directly to report(...)