Create Json Dataset

Alphonsito25 · July 17, 2022, 12:45pm

Hi, I am trying to create a class in order to read and create a json dataset for use in a CNN.
The json files are like this (pose keypoints from images):

  {
  "0": {
    "PoseKeypoints": [
        [
            2529.287109375,
            1424.733642578125,
            0.9513001441955566
        ],
        [
            2574.495849609375,
            1384.9786376953125,
            0.9392595291137695
        ],
        [ ...

I have to store this keypoints into a pytorch tensor.
My idea is to use iterable objects in the class, but I don’t know how to do it.
Thank you!!

Alphonsito25 · July 18, 2022, 2:50pm

I have written this model

data = []

class json_dataset(Dataset):
    def __init__(self, root_dir):
        self.root_dir= root_dir
    def __getitem__(self,index):
        for file in os.listdir(self.root_dir):
            if file.endswith('json'):
                json_path = os.path.join(self.root_dir, file)
                
                #json_data = pd.read_json(json_path, lines=True)
                json_data = json.load(open(json_path))
                
                for keypoints in json_data.items():
                    valores = keypoints['PoseKeypoints']
                    keypoints_normalized.append(valores)
                
                data.append(json_data)
                data = torch.FloatTensor(data)
            return(data)

What do you think?

ptrblck · July 18, 2022, 8:06pm

The general loading looks alright (replace torch.FloatTensor with e.g. torch.from_numpy or torch.tensor), but the for loop looks wrong.
In the __getitem__ method you would use the index to load a single sample while it seems you are trying to iterate all json files and return the very first one all all indices.

Alphonsito25 · July 19, 2022, 9:07am

Now I have this:

class json_dataset(Dataset):
    keypoints = []
    def __init__(self, root_dir):
    	self.root_dir= root_dir
    def __str__(self):	
    	return str(data) #???
    def __getitem__(self,index):
    	for file in os.listdir(self.root_dir):
    		if file.endswith('json'):
        		json_path = os.path.join(self.root_dir, file)

        		json_data = json.load(open(json_path))

        		for k in json_data["0"]["PoseKeypoints"]:
        			keypoints.append(k)

        		data.append(keypoints)
        		data = torch.Tensor(data)
    	return(data)

How do I do what you tell me about the index?

Also I want to check if the return “data” is correct, but when I print it returns me an empty tensor. I checked out of a class and it should me return well.
The str function is for this.

Thank you!

ptrblck · July 19, 2022, 4:30pm

In the common use case you have a specific number of samples in the Dataset and this number of samples is returned in the Dataset.__len__ function. In the __getitem__ method you are using the index (which has values in the range [0, len(dataset)-1]) to load a single sample for this index.
E.g. if each sample is stored in a separate json file in the self.root_dir you could load the corresponding file using the index instead of iterating all files.

Add print statements to the __getitem__ method and check which objects are valid and where the tensor becomes an empty one.

The __str__ function uses data, which is undefined or globally defined, so check what is being printed there.

Alphonsito25 · July 22, 2022, 4:23pm

Ok, I made some changes:

class json_dataset(Dataset):
    def __init__(self, csv_file, root_dir):
    	self.annotations= pd.read_csv(csv_file)
    	self.root_dir= root_dir
    	
    def __len__(self):
    	return len(self.annotations)
    	
    def __getitem__(self,index):
    	json_path= os.path.join(self.root_dir, self.annotations.iloc[index, 0])
    	json_data = json.load(open(json_path))
        keypoints = []
	
    	if "0" in json_data:
    		for k in json_data["0"]["PoseKeypoints"]:
    			keypoints.append(k)     				
    	data.append(keypoints)
    	data = torch.Tensor(data)
    	return(data)

The csv file has the names of the json files. Now it works, thank you!
But I have a question: How the index works so that in training you can access all the images? I need that the dataset loads sorted with the filenames.