Memory issue in nvidia retinanet

EliB · July 22, 2019, 8:15am

Hi,

I’m trying to train on my own data. I generated cocco like json with my training and validations sets and trying to run it with default argument values, excecpt --images that i’m excluding since data ois from multiple repos.
Something is loading to much batches to memory and get:

RuntimeError: CUDA out of memory. Tried to allocate 348.35 GiB (GPU 0; 7.93 GiB total capacity; 591.43 MiB already allocated; 6.54 GiB free; 26.57 MiB cached)

How to configure it to adjust batches number ?

Thanks

ptrblck · July 22, 2019, 9:10am

Could you post a link to the repo you are using?
Most likely there is a flag like --batch_size or -b where you can adjust the batch size.
Alternatively, have a look at the DataLoader initialization in the code and change the batch_size argument there.

EliB · July 22, 2019, 10:29am

Ok.
I have a config.py file with:

annotations = “/input/train/annotations.json”
val_annotations = “/input/valid/annotations.json”
backbone = ‘ResNet50FPN’
classes = 3
model = “retinanet_rn50fpn.pth”
fine_tune = “/media/Data/ObjectDetector/resnet50-19c8e357.pth”
iters = 100
val_iters = 10
lr = 0.0001
resize = 516
batch = 1
max_size = 516

Then in UseExample.py:
import retinanet.config_train as config
import retinanet.main as main
args =
[
“train” ,
config.model ,
‘–backbone’ , config.backbone,
‘–annotations’ , config.annotations,
#’–val-annotations’, config.val_annotations,
‘–classes’ , str(config.classes),
‘–resize’ , str(config.resize),
‘–batch’ , str(config.batch),
‘–max-size’ , str(config.max_size),
‘–iters’ , str(config.iters),
‘–lr’ , str(config.lr),
]
main.main(args)

ptrblck · July 22, 2019, 10:41am

Do you see this OOM error only using your custom dataset or also the original one?
If this error is raised only for your custom dataset, are you using the same image resolution or are you working with bigger images? In the latter case, could you resize your custom images to the same size and try to run the code again?

Since the batch size is already set to 1, you cannot lower is further and would need to save some memory in another part of your training.

EliB · July 22, 2019, 1:51pm

My mistake:

My json cocco like meta file was messed up. Now works wonderfully.

Sorry for bothering you