Assuming i have a tensor output a
from a network of some float [100, 100]. I then have some preallocated a_cpu
float*. I could do a.cpu(), then memcpy the data to that pointer which would be double work. Is there a way i can do .cpu(), but let it target some preallocated memory?