I have a network with input X of size of m x 2 and output Y of size m x 1. How would I be able to generate elementwise derivative of output with respect to input? i.e. a matrix of size dX of size m x 2
where dX_i,1 = dY_i / dX_i,1 and dX_i,2 = dY_i / dX_i,2 .

The problem with using jacobian is that it produces too big matrices, and I run into memory problems.
For example if input is 10x2 and output is 10x1, the jacobian is 10x20, and I only need the diagonal elements of the two 10x10 matrices stack to each other.

Yeah, so I don’t think any way to directly generate just the diagonal unless you have special properties you know about (like "only the ith input influences the ith output as we - except for batch norm - have in minibatches).