Application of map(), then save_to_disk() leaves unchanged dataset?

rbelew · July 9, 2024, 12:14am

I must be misunderstanding something: if I do this:

ds = datasets.load_dataset(csvDir,data_files=dsFiles,download_mode="force_redownload")
ds.map(preprocessOne)
ds.save_to_disk(mapDir)

and then reload via ds2 = datasets.load_from_disk(mapDir), the new ds2 does not show the benefits of the preprocessOne mapping?! what would that be?

anantguptadbl · July 9, 2024, 4:57am

Let us take a look at what ds actually produces

This is the important part

<class 'datasets.formatting.formatting.LazyRow'>

You can get more info on this here

The root cause is that you need to leverage the return value in map to update the data

import pandas as pd
from datasets import load_dataset

data = pd.DataFrame([[1,2,3], [4,5,6]], columns=['col1', 'col2', 'col3'])
data.to_csv(r"data/randomData.csv", index=False)
ds = load_dataset("data",data_files="randomData.csv",download_mode="force_redownload")

def preprocessOne(rowVal):
    rowVal['col1']=99
    return rowVal
    
dsUpdated = ds.map(lambda curRow: preprocessOne(curRow))
print(dsUpdated['train'][0])
dsUpdated.save_to_disk("data")

rbelew · July 9, 2024, 4:00pm

thanks for your comments @anantguptadbl but I’m afraid I’m missing your point. The doc you point is flagged as legacy, with current doc being this: Know your dataset.

and I am “leveraging” the return value. my preprocess function is for multiple choice examples (ala swag, cf. Google Colab)

def preprocessOne(example):
	first_sentences = [example[Context_name] for i in range(NMultChoice)]
	second_sentences = [ f"{example[str(i+1)]}" for i in range(NMultChoice) ]
	flat_first = list(chain(first_sentences))
	flat_second = list(chain(second_sentences))
	tokenized_examples = tokenizer(first_sentences, second_sentences, truncation=True)
	tokDict = {k: [v[i : i + NMultChoice] for i in range(0, len(v), NMultChoice)] for k, v in tokenized_examples.items()}	        
	tokDict['label'] = example[Label_name]
	tokDict['idx'] = example[Idx_name]
	tokDict['scores'] = {f'{i}': example[f'{i+6}'] for i in range(NMultChoice)}		
	return tokDict

anantguptadbl · July 9, 2024, 4:58pm

@rbelew
Okay perfect, you are returning the value from preprocessOne
Since you did not have any LHS variable, I assumed that you are not returning anything from the function. You just need to consume the updated structure from the map

ds = ds.map(preprocessOne)
ds.save_to_disk(mapDir)

rbelew · July 9, 2024, 5:27pm

ha, it was that simple! i just had to use ds2 = ds.map(preprocessOne) vs. what I had assumed as in-place ds.map(preprocessOne). thanks so much @anantguptadbl

anantguptadbl · July 9, 2024, 6:48pm

@rbelew No problemo. Please mark it as the solution

rbelew · July 9, 2024, 10:38pm

done! Now I’m having some issue with DataCollators and batches, but I’ll save that for another thread. thanks again.