How to do multi-label classification with TorchText?

wgpubs · December 26, 2017, 6:39pm

I can’t figure out how to properly setup a field object for multi-label classification with torchtext. Here is what I have in my dataset class:

… where lbl is a OHE numpy array (e.g., [0, 1, 0 ,0, 1, 1, 0])

My torchtext field object is defined like this:

tt_LABEL = data.Field(sequential=False, use_vocab=False)

But when I try to package everything up into a BucketIterator and get a mini-batch, I get the following exception:

only length-1 arrays can be converted to Python scalars

There error is on line 294 of field.py:
294 arr = [numericalization_func(x) for x in arr]

hiromi · January 22, 2018, 9:26pm

@wgpubs, did you ever find a work around?

wgpubs · January 27, 2018, 7:28pm

Hey @hiromi! I remember ya from fastai.

Here is code I’m using for the toxic comp. Appreciate any feedback and the good and ugly of it and what can be improved. Hope this helps:

gist.github.com

https://gist.github.com/ohmeow/5b3543a5115040001fce59a105ac4269

toxic.py

class TextMultiLabelDataset(torchtext.data.Dataset):
    def __init__(self, df, tt_text_field, tt_label_field, txt_col, lbl_cols, **kwargs):
        # torchtext Field objects
        fields = [('text', tt_text_field)]
        for l in lbl_cols: fields.append((l, tt_label_field))
            
        is_test = False if lbl_cols[0] in df.columns else True
        n_labels = len(lbl_cols)
        
        examples = []

This file has been truncated. show original

hiromi · January 30, 2018, 2:23am

Awesome! Thanks for the example!! You’re way ahead of me

hiromi · February 14, 2018, 4:55am

I’m currently trying to see if I can get data.TabularDataset to work kind of like this one:

My brain is too tired to keep going tonight, but I will get back to it tomorrow.

hiromi · February 21, 2018, 3:38am

@wgpubs, I’ve tried many things, and your implementation is the best and cleanest!!!