I am new to PyTorch, and I am looking for some PyTorch Coding Conventions or Best Practices. PyTorch is fantastic to allow you a lot of freedom, but it can sometimes be challenging to find something in someone else code when they have a completely different way of coding with PyTorch. There might also be some best practices to ensure your code can run as fast as possible.
I am thinking about something similar to Serialization semantics but describing more straightforward cases such as in which context should one create a separate module to define a part of a model and the way it should be structured.
The goal is that anyone who knows the coding conventions can easily find what they are looking for in the code and extend it in a way other people will be able to understand it quickly too.
I don’t think, there are some documents like that (at least not official ones), but if you go by python, I’d recommend you to follow the PEP 8 style as this mostly applies to pytorch as well. There are a few pytorch specific things, I’d append:
don’t use autograd of not necessary (use with torch.no_grad() if possible)
only push tensors to GPU, if they are actually needed
try to avoid loops over tensor dimensions (slows things down)
try to free graphs as soon as possible (use detach or item whenever you can) to avoid memory leaks
If there’s anything coming up to my mind, I’ll just edit this post.
PEP8 (PyTorch uses flake8 for coding style itself) is a good idea, as is the general Python Zen. Don’t write Python like people who don’t like Python.
However, a couple of the PyTorch-specific items I disagree with (in the top two below). My list would be something like:
If you need to use torch.no_grad() somewhere where it isn’t because you’re evaluating something that’s written for training, you should ask yourself if you’re doing it wrong.
Be mindful of loops over tensor dimensions slowing that down. It’s conventional wisdom to avoid these, but there are quite a few legitimate cases for them in PyTorch. I have a half-written section “For loop or not for loop” discussing them somewhere.
Using item and detach for things to keep around longer than the next backward is generally a good idea (e.g. when you record loss history, statistics, …), but be careful to not ruin your graph. (Targeted detach is good in nn.Module subclass code, with torch.no_grad() should be needed very rarely.)
If you write for re-use, the functional / Module split of PyTorch has turned out to be a good idea.
Use functional for stuff without state (unless you have a quick and dirty Sequential).
Don’t use deprecated stuff .data, Tensor(...) and friends, .type (might be me), t.new_.... It’s bad!
Use the documented PyTorch interfaces if you can (e.g. when something from torch.nn.functional shows up in torch for internal reasons.
Benchmark your stuff if you think it’s performance critical. Don’t forget CUDA needs synchronising for valid results.
The JIT will speed up chains of pointwise ops a lot.
C++ will be a bit faster than plain Python but for many cases only ca 10%.
I agree with you on the first point. Maybe I did not express my intention very well. What I meant is, that you could theoretically also validate/predict without no_grad. Of course you’re right, but I just assumed people to know, where they need gradients and where they aren’t necessary.
Regarding the second point on your list: sure there are some legitimate cases (LSTMs and stuff), but for most cases they can be avoided. I’ll look forward to your discussion on that!
I meant to delete the first sentence (and did now), sorry. It’s probably hard to be the first.
My more controversial / situation specific things:
Personally, I tend to copy code into one giant notebook and I think most configuration things (argparse etc) are terrible.
When you have all your stuff in def main(): and I change something and get an exception, I cannot use ipython -i foo.py to inspect the variables in main.
For me it’s kind of the other way round. I absolutely prefer splitting code in many files and packages, because this way it’s easier to avoid confusion (for me).
And I usually have a configuration file defining all the hyperparameters. This is coming since I can heavily parallelize jobs on a cluster (grid search etc.) and don’t want to do this manually. Also the files are copied to the directory containing the weights to keep an overview on trained configurations.
I don’t prefer ipython or notebooks at all for GPU related stuff since you always need to restart the notebook server to free the GPU memory.
While I appreciate that people want checklists and easy to follow instructions and I’m sure that your checklist is as good as any other, checklist invariably mix obvious good things with suggestions of questionable merit. For example I never provide that 10% speedup estimate without context or the opportunity to ask for context, much like Justus’ advice on avoiding loops is quite right for a lot of cases, but it’s important to know when it’s not applicable. When you provide advice in a checklist form, all that context is lost. Here you took a bunch of bullet points and left out even the few qualifications I put in that overly condensed form of discussion.
And that’s the crux: Good style cannot be achieved by following checklists, much like you don’t gain much wit by buying a book of famous quotes. If there is a craft component to writing code, you need to learn - possibly by studying what people who would know - say like Soumith - wrote and trying to understand how it works and why it was written that way.
By following a checklist approach you get the code equivalent of how development processes look like when large companies decide to do “agile”, and implements it just by following an “agile checklist” someone gave them.
I agree with you. Nothing would ever replace the process of learning by reading and trying to understand what more advanced people have done. However, having some simple snippets to start with the simple tasks and then being able to understand how they work, is also an excellent way to learn too.
Hey Tom,
If you want to debug your code you should try the python debugger pdb (it took me a while to start using it but it’s a game changer).
ipython -m pdb is your best friend in these cases.
Jupyter notebook (and many IDE’s) also support it pretty well. There are probably some good tutorials around…
I fully agree with your points. The best way to learn is by doing.
During my learning process, I often struggled because of missing documentation or seeing different ways of doing the same thing. I learned a lot about how to use PyTorch more efficiently by spending hours studying repositories made by companies such as Nvidia or Facebook.
The goal of this style guide and best practices summary is just to help others and myself to learn from this journey.
Being in the area of deep learning since quite a while and starting with tensorflow I saw how many people (as well as myself) struggled with getting started on their custom projects. Tutorials teach you on how to implement a specific model to solve a task but unfortunately, they don’t always tell you why a certain workflow or coding style / pattern can help you avoiding mistakes and keeping the project clean.
This is also one of the main reasons I like PyTorch. Being simpler and more intuitive for Python users, it just allows you to learn faster and make fewer mistakes. In tensorflow I was confused by different strategies on how to build a model all over the place and when to pick one in particular.