Does batch size have any effect on divergence of training alogorithm?

I want to implement the yolo (You look only once) in Pytorch. I wrote its code and set its batch size to 64. But when I ran the algorithm, its cost always increased. But when I set the batch size to 32, the cost decreased in a long term. Could you please tell me Is it logical or not?

Smaller batchsize gives the gradients sufficient noise to jump out of valleys. That being said 64 is not that big size. Are you seeing at the loss per image or the total loss (which would be higher for higher batch size).

Thanks for your response @Mika_S! I have found that one of my code line had some problem which it was a NAN value. I would like to know, have you ever read the yolo paper?

I have read yolo paper an year back. But shoot me questions and I can try to answer as best as I can :).

I have a question about its cost function implementation. Because the source code was provided by C language and I am so newbie in c. I would like to know, have you got any implementation about the cost in python? I have implemented it but i think the yolo’s authors use some tricks to obtain the best answer which they are unknown. Is it possible to collaborate with each other to provide the python implementation of that?

I have started a post in google group of darknet (Yolo basic framework) link, but did not get any answers.

Sorry for the late reply. Unfortunately I do not have an implementation of the cost in python.

Have you looked at this: