Hey there,
I’d like to leverage the meta
device to efficiently load large models (i.e. MPT-30B, etc) onto my GPU, without first creating the model on the CPU and then doing a copy. I would also like to do quantization on the fly so this seems like a necessary first step.
My understanding is that the meta
device should be used for this purpose, but I’m not sure how to proceed.
I did trace through the accelerate
implementation:
https://github.com/huggingface/accelerate/blob/main/src/accelerate/big_modeling.py#L79
but it’s unclear to me why the init_on_device
function is doing all this work.
Is there a simple clear implementation that I could trace through to understand all the steps properly.
Thanks!