Yes, since PyTorch binaries ship with their own CUDA dependencies and you thus won’t need to install a local CUDA toolkit.
If the model and data were moved to the GPU their operations will be performed on the GPU. Other tensors (on the CPU) can still perform operations on the CPU.
If you moved the model and data to the GPU all operations will be performed on the GPU.
To your profiler question: I’m not familiar enough with the built-in profiler and use Nsight Systems instead, so you might also give it a try.