Make a binary classification RNN/LSTM to only focus on "malicious" words to make the model more robust? (Non negative model)

There are some works that make a model robust against attacks, by making the model “focus” on only malicious features, therefore addition of benign features will not affect the outcome of model and only addition of malicious content can change its prediction.

more on this :

https://towardsdatascience.com/evading-machine-learning-malware-classifiers-ce52dabdb713 (read Non-Negative MalConv)

Non-Negative MalConv was constrained during training to have
non-negative weight matrices. The point of doing this is to prevent
trivial attacks like those created against MalConv. When done
properly, the non-negative weights make binary classifiers monotonic;
meaning that the addition of new content can only increase the
malicious score. This would make evading the model very difficult,
because most evasion attacks do require adding content to the file.
Fortunately for me, this implementation of Non-Negative MalConv has a
subtle but critical flaw.

my question is, how can i implement this in RNN/LSTM using pytorch?

my model currently takes a sequence of words and predicts whether the sentence is malicious or not, and right now addition of a lot of benign words (by beaning i mean words that appear in a lot of benign sentences) will evade the model.

basically i want my model to learn to predict based on only words that are malicious, meaning mostly appear in malicious samples, and addition of benign words that appear in many benign sentences will not change the prediction of the sentence (so adding many benign words will not affect the prediction)

how can i implement this? is this possible?