Libtorch low latency problem

I use std::memcpy to copy an image, given by a high speed camera, to a buffer. The overal idea is that I copy & buffer camera scanlines into a std::vector; when the capturing is finished I’ll construct a libtorch tensor from this. This tensor is then used further in the software to do image procesing.

This works fine: the time needed to copy the contents to the buffer is typically ~1 millisecond;

torch::from_blob((char*)image->buf, { res.spatial, res.spectral }, torch::TensorOptions().dtype(torch::kUInt8)).unsqueeze(0).clone()

…and the result is correct in terms of memory layout.

However: sometimes the copy time is 40-80 milliseconds, which is 40-80 times (!!) more than the expected time.

  1. Why is this? and how do I solve this?
  2. Does anyone notice some code aspect that can be improved (for speed)?
std::vector<torch::Tensor> result;          
result.reserve(nbrOfLines);

camera->startCameraStream();
auto tensor = torch::empty({res.spatial, res.spectral }, torch::kUInt8);
 for (auto lineIndex = 0; lineIndex < nbrOfLines; lineIndex++)
 {
    struct image_buffer* image;
    auto success = camera->getImage(&image); 
    while (!success)                
        success = camera->getImage(&image);             

    result.push_back(torch::from_blob((char*)image->buf, { res.spatial, res.spectral }, torch::TensorOptions().dtype(torch::kUInt8)).unsqueeze(0).clone());

    // Return the result so it can be filled again...
    camera->returnImage(&image);
 }

 auto framegrab = torch::cat(result,0);

I’d suggest you try two things:

  1. Run the code with the no_grad guard on, so that there’s even lesser overhead overall: Typedef torch::NoGradGuard — PyTorch main documentation

  2. Run the code with torch::set_num_threads(1); just in case multithreading is somehow causing some kind of cache trash.

The other common reason that you’d see a big stall like that is because your memory allocator needs to defragment and realloc. That’s more of an “above pytorch” problem, but you can switch your programs allocator to jemalloc which is often faster.

Thanks for the advice @smth!

Good tip: I’ll implement the NoGradGuard and torch::set_num_threads(1); tomorrow. There was also an interesting suggestion from a college of mine to do the following:

Preallocate a huge tensor ‘torch::tensor framegrab = torch::zeros({#lines, #colums, #wavelengths}’ before the actual grabbing (for each line index…), and then fill it up with scanlines.
Something in the order of:

memcpy(framegrab[lineindex, :, :].data(), image_buffer)

The problem here is that I don’t know how to copy the 2D scanlines (cols x wavelengths) into the 3D tensor by using e.g. memcpy.

I also think (acctually i’m quite sure) that you rooted the cause of the problem to the memory allocator, but I have no clue on how to set the allocator to jemalloc. The code runs on Windows 10 with a visual studio compiler, C++17, libtorch.

you can actually do what your friend suggested much more efficiently than memcpy by using copy_. And yes, it’s probably better than a result vector + torch::cat.

Something like:

torch::Tensor result_tensor = torch::from_blob((char*)image->buf, { res.spatial, res.spectral }, torch::TensorOptions().dtype(torch::kUInt8)).unsqueeze(0);
framegrab[lineindex, :, :].copy_(result_tensor)

It’s usually faster than memcpy because we use SIMD and multiple threads to do the copy in parallel (memcpy by default usually restricts to one thread).