How to compute cosine similarity for a mini batch of samples for 2d data

I have an input X of dimension BSD where B is batch_size, S is sample_size, and D is embedding_dimension. I want to compute cosine similarity for X with itself i.e., samples in batch(all pairs). Cosine_similarity is done for 1D or 2D. How to compute efficiently for mini-batch of samples (for all possible pairs). The expected output of dimension 2 * B * S * S. Please help to solve this issue. Thanks in advance.