WHY batch_size=1 and num_workers>0 still facilitate loading speed

Recently in my project, I need to go deeper on how Dataloader works.
I know that when setting num_workers>0 can make loading data into RAM faster,
BUT when using batch_size=1 why the loading speed still follows num_workers=4>num_workers=2>num_workers=0?

The dataset defines commonly as:

def __getitem__(self, index):
    path=get_datafile(index)
    data=pil_loader(path)
    return data, get_target(index)

Are workers caching data into RAM when num_workers>0 and how do they know which index to cache?
Or one sample (batch_size=1) is loaded by several workers?

Each worker loads a complete batch, i.e. batch_size samples.
The used indices are specified by the sampler or if no sampler is defined by the shuffle option in the DataLoader, which will internally create a SequentialSampler or a RandomSampler.

How did you profile the code and what does

mean?