How to define a pytorch class Dataset which take dataset with and without label

KNEC · December 10, 2021, 11:00am

Hello, I ma trying to define a Dataset class which is suppoose to treat dataset with label and without label. But I ma getting and error

Item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()} AttributeError: ‘list’ object has no attribute ‘items’

code

# Create torch dataset
class Dataset(torch.utils.data.Dataset):
	def __init__(self, encodings, labels=None):
		self.encodings = encodings
		self.labels = labels

	def __getitem__(self, idx):
		item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
		if self.labels:
			item["labels"] = torch.tensor(self.labels[idx])
			#print(item)
		return item

	def __len__(self):
		print(len(self.encodings["input_ids"]))
		return len(self.encodings["input_ids"])


# prepare dat for classification

tokenizer = FlaubertTokenizer.from_pretrained(model_name)
	print("Transform xml file to pandas series core...")
	text, file_name = transform_xml_to_pd(file)  # transform xml file to pd
	
	# Xtest_emb, s = get_flaubert_layer(Xtest['sent'], path_to_model_lge)  # index 2 correspond to sentences
	#print(text)
	
	print("Preprocess text with spacy model...")
	clean_text = make_new_traindata(text['sent'])
	#print(clean_text[1])  # clean text ; 0 = raw text ; and etc...
	
	X = list(clean_text)
	X_text_tokenized = []
	
	for x in X:
		#print(type(x))
		x_encoded = tokenizer(str(x), padding="max_length", truncation=True, max_length=512)
		#print(type(x_encoded))
		#print(x_encoded)
		X_text_tokenized.append(x_encoded)
		
	#print(type(X_text_tokenized))
	
	
	X_data = Dataset(X_text_tokenized)
	
	print(type(X_data))
	print(X_data['input_ids'])
Error :slight_smile:


  File "/scriptTraitements/classifying.py", line 153, in __getitem__
    item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
AttributeError: 'list' object has no attribute 'items'

Any idea ?

furthermore How can I access element of the corpus pass to the Dataset class. When printing it , I get only this : <main.Dataset object at 0x7fc93bec1df0>

mMagmer · December 10, 2021, 11:48am

it says that you’re not suppose to use .items ( a dict method) on list object.

user_123454321 · December 11, 2021, 11:16am

self.encodings is a list (of dictionaries) and thus it does not have a .items() function. Taking a guess that X_text_tokenized contains all data and we need to index it to get one item, I think you need to update

item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}

to

item = {key: torch.tensor(val) for key, val in self.encodings[idx].items()}