Is there a common way of finding feasible word compositions?

I am working on vision but trying multi modality rn. So I need to figure out if a given word composition exists within everyday context (eg. ‘ripe cat’ would never occur but a ‘wet cat’ would). I was thinking of finding bigram and trigram possibilities within some big nlp context text but have no idea where to begin. Any topic related to this or paper suggestions highly appriciated.

It kind of sounds like you’re looking for n-gram language models – e.g.: for bigrams: What is the probability that the word “ripe” is followed by the word cat. And yes, it basically requires to go through a large corpus an count n-grams. I have to basic Jupyter notebooks here and here.

Keep in mind that most language models perform smoothing to avoid zero-probabilities. So you need to find some notion which probability is low enough to be considered “does not exist in everyday context”. For example, maybe a non-native speaker has translated “fertile cat” with “ripe cat” :).

1 Like

Thank you! This is what I was playing with last few days but couldn’t find any large corpus that would fit everything I wanted with good generalizability. That’s what I am struggling rn.

Welcome to the tedious 80% of machine learning / deep learning: collecting, cleaning, processing, preparing etc. good data :).