Profiling GPU/CPU usage on a EC2 instance when training a PyTorch Lightning model

Hi, interested in profiling the GPU/CPU usage when training my model. Specifically, I’m interested in profiling my PyTorch lightning data module and lightning model. Would like to figure out what variables are taking up most space, and what functions can be optimized.

  1. I know that vanilla PyTorch profilers enabled DataModule monitoring. Does the same apply to Lightning?
  2. Are there other things I can look into to track memory usage?
  3. How do I visualize profiler activity on Tensorboard?