I use std::memcpy to copy an image, given by a high speed camera, to a buffer. The overal idea is that I copy & buffer camera scanlines into a std::vector; when the capturing is finished I’ll construct a libtorch tensor from this. This tensor is then used further in the software to do image procesing.
This works fine: the time needed to copy the contents to the buffer is typically ~1 millisecond;
…and the result is correct in terms of memory layout.
However: sometimes the copy time is 40-80 milliseconds, which is 40-80 times (!!) more than the expected time.
Why is this? and how do I solve this?
Does anyone notice some code aspect that can be improved (for speed)?
std::vector<torch::Tensor> result;
result.reserve(nbrOfLines);
camera->startCameraStream();
auto tensor = torch::empty({res.spatial, res.spectral }, torch::kUInt8);
for (auto lineIndex = 0; lineIndex < nbrOfLines; lineIndex++)
{
struct image_buffer* image;
auto success = camera->getImage(&image);
while (!success)
success = camera->getImage(&image);
result.push_back(torch::from_blob((char*)image->buf, { res.spatial, res.spectral }, torch::TensorOptions().dtype(torch::kUInt8)).unsqueeze(0).clone());
// Return the result so it can be filled again...
camera->returnImage(&image);
}
auto framegrab = torch::cat(result,0);
Run the code with torch::set_num_threads(1); just in case multithreading is somehow causing some kind of cache trash.
The other common reason that you’d see a big stall like that is because your memory allocator needs to defragment and realloc. That’s more of an “above pytorch” problem, but you can switch your programs allocator to jemalloc which is often faster.
Good tip: I’ll implement the NoGradGuard and torch::set_num_threads(1); tomorrow. There was also an interesting suggestion from a college of mine to do the following:
Preallocate a huge tensor ‘torch::tensor framegrab = torch::zeros({#lines, #colums, #wavelengths}’ before the actual grabbing (for each line index…), and then fill it up with scanlines.
Something in the order of:
The problem here is that I don’t know how to copy the 2D scanlines (cols x wavelengths) into the 3D tensor by using e.g. memcpy.
I also think (acctually i’m quite sure) that you rooted the cause of the problem to the memory allocator, but I have no clue on how to set the allocator to jemalloc. The code runs on Windows 10 with a visual studio compiler, C++17, libtorch.
you can actually do what your friend suggested much more efficiently than memcpy by using copy_. And yes, it’s probably better than a result vector + torch::cat.
It’s usually faster than memcpy because we use SIMD and multiple threads to do the copy in parallel (memcpy by default usually restricts to one thread).