Customize torchtext dataset for translation

keyu_Chen · September 13, 2021, 9:34pm

I am new to Torchtext, recently I try to train a translation model by using the opus dataset (https://opus.nlpl.eu/). But Torchtext can only read CSV/JSON/TSV files, and the opus datasets are not in those three. I am struggling with this problem and barely find resources to solve this problem.

I turn some .txt files into .csv and try to customize the dataset, but the error occurred.
Here is the code:
spacy_ger = spacy.load(“de_core_news_sm”)
spacy_eng = spacy.load(“en_core_web_sm”)

def tokenize_ger(text):
return [tok.text for tok in spacy_ger.tokenizer(text)]

def tokenize_eng(text):
return [tok.text for tok in spacy_eng.tokenizer(text)]

german = Field(tokenize=tokenize_ger, lower=True)

english = Field(tokenize=tokenize_eng, lower=True)

source_data, target_data = TabularDataset.splits(path=’./’,
train=‘src-test.csv’,
test=‘tgt-test.csv’,
format=‘csv’,
fields=(english,german))

source_data, target_data = TabularDataset.splits(path=’./’,

File “E:\Work\anaconda\envs\con_38\lib\site-packages\torchtext\data\dataset.py”, line 77, in splits
train_data = None if train is None else cls(

File “E:\Work\anaconda\envs\con_38\lib\site-packages\torchtext\data\dataset.py”, line 271, in init
examples = [make_example(line, fields) for line in reader]

File “E:\Work\anaconda\envs\con_38\lib\site-packages\torchtext\data\dataset.py”, line 271, in
examples = [make_example(line, fields) for line in reader]

File “E:\Work\anaconda\envs\con_38\lib\site-packages\torchtext\utils.py”, line 130, in unicode_csv_reader
csv.field_size_limit(sys.maxsize)

OverflowError: Python int too large to convert to C long

ptrblck · September 14, 2021, 3:36am

You could try to use this suggestion.