Overwhelmed by PyTorch and NLP

I’m not sure if this is a suitable forum for seeking general help and advice. I’m new to the field of Applied DL in NLP. I’ve studied basics of DL and implemented CNNs in Keras. I’ve started a project which involves trying out various NN architectures (LSTMs, Bi-LSTM-CNN-CRFs, Transformers-BERT etc.) on news articles data.

However, I’m being overwhelmed by the many architectures to understand and especially the lengthy PyTorch codes. There aren’t many online courses that teach how to implement advanced models in PyTorch and I just can’t make sense of how to use GitHub repos provided in research papers like “Attention is All you Need” or that of BERT (Devlin et al.)

Any help related to this would be highly appreciated. Thanks

Since I’m kind of in the same boat, here is just my personal opinion: At the end of the day, you simple have to bite the bullet. Deep learning and Natural Language Processing are both complex topics on their own and even more so when put together. Here just some pointers based on my personal experience:

  • Understand the basics: Many modern architectures simple say that “…we use attention…” like it is a well-established concept. And maybe it is. Still, as a newcomer it’s important to go back to see where attention is coming from. It’s often too easy to simply copy/paste a tutorial or other existing code that does something without really knowing what. I’m certainly guilty of that.

  • Be critical/skeptical of available code: With AI/DL/ML such hot topics, everyone seems to get on the bandwagon. The “problem” is that it’s actually very easy to get started. Install PyTorch, download a dataset, copy some tutorial code, and you quickly get first results. However, I’ve seen a lot of code on Github that was not correct. As long as the dimensions of your tensors you pass between layers, no errors are thrown. However, that doesn’t mean that it’s correct. For example, take a multilayer LSTM. A subsequent linear layer doesn’t care if you give it the hidden state of the first (wrong) or the last (correct) layer. Another common culprit is a “blind” usage of, e.g., view() just to get the dimensions right, no matter how.

  • Even peer-reviewed work is not perfect: The “publish or perish” culture in academia is alive and well, and again, AI/DL/ML is the topic of current times. I’ve read papers that are simply not doing what they promise – that’s not just armchair reasoning; I re-implemented and evaluated them. It’s also too easy to make established network models “just a bit more complex” and publish the results again.

In short, feeling overwhelmed is natural. New architectures – or at least sold as such – are presented almost weekly.


If you are new to PyTorch, there are a lot of tutorials to understand the basics and give some models a try. Some tutorials may be useful here.