The difference is that if you use the detached tensor and for example compute a loss, the gradients of that loss will be only computed backward upto the point of detachment. However, if you use the other one, the gradients will be computed even further back.

So in this example, the gradients of z1 will affect both m1 and m2, while the gradients of z2 only affect m2. The reason is z2 is computed from h2 and h2 which is detached from the computation graph.