Newbie questions - torch on Google Colab

Im training a RAVE model and having some questions.
Its my first time using the library and ml training ever, so forgive my ignorance.

Got a colab pro plan just to avoid disconnections but still getting quite some.
Googling around about batch size & num_workers yield few results, but still I dont get it fully.

    1. The first warning Im worried about is this one. Apparently colab complains about the number of workers. Note that Im running the python script via !python train_rave.py:
UserWarning: This DataLoader will create 8 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
    1. Something odd is happening with my ckpt’s. They seem to load fine but I get the following warning once I resume a training after a disconnect:
model_checkpoint.py:343: UserWarning: The dirpath has changed from 'runs/bulgarianvoices/rave/version_0/checkpoints' to 'runs/bulgarianvoices/rave/version_1/checkpoints', therefore `best_model_score`, `kth_best_model_path`, `kth_value`, `last_model_path` and `best_k_models` won't be reloaded. Only `best_model_path` will be reloaded.
  f"The dirpath has changed from {dirpath_from_ckpt!r} to {self.dirpath!r},

Just want to check with experienced users if this is just normal or something I should worry about. RAVE trains on audio with 8 Batches and 8 workers.

Regarding google drive I also have some doubts.

    1. Are previous ckpt’s at all necessary ?
      could I just remove previous runs and just keep the last.ckpt and corresponding run ? My google drive is filling up quite fast otherwise

After training for a while I also get a “not available” message on the connect colab button , where normally displays RAM and DISK usage. The script still keeps working but is just interesting to me, with a pro account still don’t have much reliability on that sense.

I would like to still use gdrive to keep the ckpt’s , otherwise on every disconnect I would loose all the work. But I guess is slowing the process to have all running from a gdrive instance.

Any tips or references to docs in this realms will be very appretiated. I dont have experience with torch but Im confortable with python/system etc.

This is the project I run in colab https://github.com/acids-ircam/RAVE

tqdm==4.62.3
effortless-config==0.7.0
einops==0.4.0
librosa==0.8.1
matplotlib==3.5.1
numpy==1.20.3
pytorch-lightning==1.6.1
scikit_learn==1.0.2
scipy==1.7.3
soundfile==0.10.3.post1
termcolor==1.1.0
torch==1.11.0
tensorboard==2.8.0
GPUtil==1.4.0
git+https://github.com/caillonantoine/cached_conv.git@v2.3.5#egg=cached_conv
git+https://github.com/caillonantoine/UDLS.git#egg=udls

Its really hard to see the effects on changes made to the code since the waiting period is really long , so I would love to optimize it with colab specific practices. Thinking on running the code in cells instead of from !python

This warning is raised from the DataLoader as it tries to suggest the recommended number of workers based on the system resources here.
Try to lower the number to the suggested and and check if it yields a speedup.

This warning seems to be raised from Lightning here and I don’t know what might be causing it. Based on the message I guess different “versions” are stored and the latest one is used to reload the model?

Could you explain what “version” refers to? If these are corresponding to e.g. the epoch, you could try to save only the last or best epoch to avoid the storage issue.

Thanks for the input !

This warning seems to be raised from Lightning here and I don’t know what might be causing it. Based on the message I guess different “versions” are stored and the latest one is used to reload the model?

Yes, and sometimes because of how colab handles disconnect I guess I also get this one

UserWarning: You're resuming from a checkpoint that ended before the epoch ended. This can cause unreliable results if further training is done. Consider using an end-of-epoch checkpoint or enabling fault-tolerant training: https://pytorch-lightning.readthedocs.io/en/stable/advanced/fault_tolerant_training.html
  "You're resuming from a checkpoint that ended before the epoch ended. This can cause unreliable"

Then Im again unsure if acting correctly

This warning is raised from the DataLoader as it tries to suggest the recommended number of workers based on the system resources here.
Try to lower the number to the suggested and and check if it yields a speedup.

I tried with my own fork of the project to change Batches and workers and didnt turn out so well, but needs more testing

Could you explain what “version” refers to? If these are corresponding to e.g. the epoch, you could try to save only the last or best epoch to avoid the storage issue.

Thats exactly what I was thinking I could do and thanks to your comment now I know that I can do it without fear of loosing anything in the final export to .ts
My fear was that deleting previous chkpt’s would somehow affect the final result when exporting to ts.
I would love to have a good GPU and ditch colab for good , but at the moment is all I can work with to experiment and learn .

Will look for resources also on understanding tensorboard better since I dont really have a clue of what Im doing and all looks so cryptic.

Getting around 2 Million steps at the moment , with the mentioned limitations and perks , but at least getting somewhere I hope .

Thanks again for the input @ptrblck