Deal with saved features (very large)

kl_divergence · January 22, 2020, 4:32am

Hi,

My model is a transformer and have a single GPU, so naturally I have CUDA memory constraints. I’ve saved Faster RCNN features in a tsv format which is 40 GB in size. Now when I start, all the data from the features is loaded in CPU RAM and I run of out of memory. I’ve set num_workers=1 btw. Another option is obtain features from Faster RCNN in real time but both transformer and faster_rcnn can’t sit on my GPU RAM. Is there an efficient way of obtaining features from the tsv file ?
My code looks like:

with open(fname) as f:
        reader = csv.DictReader(f, FIELDNAMES, delimiter="\t")
        for i, item in tqdm(enumerate(reader)):

            for key in ['img_h', 'img_w', 'num_boxes']:
                item[key] = int(item[key])
            
            boxes = item['num_boxes']
            decode_config = [
                ('objects_id', (boxes, ), np.int64),
                ('objects_conf', (boxes, ), np.float32),
                ('attrs_id', (boxes, ), np.int64),
                ('attrs_conf', (boxes, ), np.float32),
                ('boxes', (boxes, 4), np.float32),
                ('features', (boxes, -1), np.float32),
            ]
            for key, shape, dtype in decode_config:
                item[key] = np.frombuffer(base64.b64decode(item[key]), dtype=dtype)
                item[key] = item[key].reshape(shape)
                item[key].setflags(write=False)

            data.append(item)
    return data