How to convert torch_geometric Data to bytes-like or how to load them faster?

Kinny_Fly · April 30, 2021, 3:52am

Hello! I’m working on GCN recently, I build a dataset myself, consisting of about 150,000 datas with a data type of torch_geometric Data. Specifically, I rewrite Data class simply and add another attribute z which is a extra target as my model have two targets:

from torch_geometric.data import Data
class MyData(Data):
    def __init__(self,x,y,z,edge_index,pos):
        super(MyData, self).__init__(x=x,y=y,edge_index=edge_index,pos=pos)
        self.z = z

I save my datas(named as sample) one by one simply using torch.save:

torch.save(sample, 'new_graph_data/{}.pt'.format(str(image_id)+'_'+str(gt_id)))

Therefore, I have got 150,000 pt-files, leading to EXTREMELY slow speed when I try to loading my datas in a traditional way:

all_data_path = os.listdir('F:/CODE/Pytorch/Diplomer/new_graph_data/')
train_ = 5000 # for testing
test_ = 6000
train_data_path = all_data_path[:train_]
test_data_path = all_data_path[train_:test_]

class MyOwnDataset(Dataset):
    def __init__(self, root, mode):
        super(MyOwnDataset, self).__init__()
        self.root = root
        self.mode = mode

    def get(self, idx):
        if self.mode == 'train':
            data = torch.load('F:/CODE/Pytorch/Diplomer/new_graph_data/'+train_data_path[idx])
        elif self.mode == 'test':
            data = torch.load('F:/CODE/Pytorch/Diplomer/new_graph_data/'+test_data_path[idx])
        return data
            
    def len(self):
        if self.mode == 'train':
            return train_
        elif self.mode == 'test':
            return test_-train_

After researching, I found that the troublesome slow speed results from the loading of huge amount of small files(for I save my pt-file one by one).So I decide to use LMDB to load my files.

txn = env.begin(write=True)
tqdm_iter = tqdm(enumerate(zip(all_file_list, keys)), total=len(all_file_list), leave=False)
for idx, (path, key) in tqdm_iter:
    tqdm_iter.set_description('Write {}'.format(key))  
    key_byte = key.encode('ascii')
    data = torch.load(path)
    txn.put(key_byte, data)

But it’s known that LMDB requires a data type of bytes, I have no idea how to convert them.
Can u give me some advice?
Or maybe u have other efficient ways to load geometric Data? Any ideas can help me load faster?
Thank u so much!!