How to deal with having multiple images per subject for just one of my inputs while only having one image per subject for other inputs?

jaykay · January 24, 2025, 8:18pm

Hi all, I am new to PyTorch and deep learning in general and I am developing a classification model based on a study I found.

The data is initially in signals form but the study I am following converts these signals into images (of size 3x150x150) first and feeds them into the model, which is why I did the same. The model is supposed to take 3 image inputs: blood pressure (bp), heart rate (hr), and ecg and emg data in one image (ecg_emg).

I have 24-hour data, as the BP and HR were collected only once every 20-30 minutes it was easy to convert the entire signal into one 3x150x150 image. However, the ecg_emg sampling rate is 1000 Hz and so I cannot fit the entire signal onto a 150x150x3 image.

So for ecg_emg data I have set it to be as shape (num_subjects, graphs_per_subject, 3, 150, 150) whereas the other BP and HR data are (num_subjects, 3, 150, 150).

This leads to memory problems as the graphs_per_subject are more than 1000, and even using batch sizes of 4 causes the CUDA out of memory: error:
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 10.60 GiB. GPU 0 has a total capacity of 79.15 GiB of which 8.32 GiB is free. Including non-PyTorch memory, this process has 70.81 GiB memory in use. Of the allocated memory 70.22 GiB is allocated by PyTorch, and114.36 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (CUDA semantics — PyTorch 2.5 documentation)

In the scenario above I had 988 images per subject so in batches I was passing tensors of (4, 988, 3, 150, 150).

Is this a common problem and how do people usually aproach it? I am ofcourse open to other ideas as well as I have been stuck on this problem for a few weeks now and I am stuck between downsampling the ecg_emg heavily or using 1-10 ecg_emg images per subject, which causes significant loss in information, or using lots of images per subject but crashing due to memory.

I also am implementing the DataLoader in my code but I am still getting the same error:

data_combined = torch.cat((streaming_data_bp_flat, streaming_data_hr_flat, streaming_data_ecg_emg_flat, structured_data.view(structured_data.size(0), -1)), dim=1)

# 'labels' is already the correct shape [81]
dataset = TensorDataset(data_combined, labels)

# Split dataset into train, validation, and test sets
train_size = int(0.6 * len(dataset))
val_size = int(0.2 * len(dataset))
test_size = len(dataset) - train_size - val_size
train_dataset, val_dataset, test_dataset = random_split(dataset, [train_size, val_size, test_size])

# Create DataLoaders
batch_size = 4
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, num_workers=4)
val_loader = DataLoader(val_dataset, batch_size=batch_size, num_workers=4)
test_loader = DataLoader(test_dataset, batch_size=batch_size, num_workers=4)

Training the model:

for epoch in range(n_epochs):
    model.train()
    train_loss = 0.0
    for combined_data, labels in train_loader:
        # Dynamically split the flattened data into BP, HR, and ECG/EMG
        # Size of each modality after flattening
        bp_size = 67500  # BP: 150 * 150 * 3
        hr_size = 67500  # HR: 150 * 150 * 3
        ecg_emg_size = 67500*num_ecg_emg_images  # ECG/EMG size after flattening

        # Extract BP, HR, ECG/EMG, and structured data
        bp = combined_data[:, :bp_size]
        hr = combined_data[:, bp_size:bp_size + hr_size]
        ecg_emg = combined_data[:, bp_size + hr_size:bp_size + hr_size + ecg_emg_size]
        
        structured_data = combined_data[:, -70:]  # Extract structured data (last 70 columns)

        # Reshape BP, HR, and ECG/EMG to [batch_size, 3, 150, 150]
        bp = bp.view(bp.size(0), 3, 150, 150)
        hr = hr.view(hr.size(0), 3, 150, 150)
        ecg_emg = ecg_emg.view(ecg_emg.size(0), 3, num_ecg_emg_images, 150, 150)  # Shape is now identical to BP and HR

        print("Checking data loader: Shape of ecg_emg: ", ecg_emg.shape)
        
        # Reshape structured data to [batch_size, 1, 70]
        structured_data = structured_data.view(structured_data.size(0), 1, 70)
            
        # Move data to the appropriate device
        bp, hr, ecg_emg, structured_data, labels = (
            bp.to(device),
            hr.to(device),
            ecg_emg.to(device),
            structured_data.to(device),
            labels.to(device),
        )
            
        optimizer.zero_grad()
    
        # Pass the data through the FeatureFusionModel
        outputs = model(bp, hr, ecg_emg, structured_data)
            
        loss = criterion(outputs.squeeze(), labels)
        loss.backward()
        optimizer.step()
            
        train_loss += loss.item()
    
        gc.collect()  # Triggers garbage collection to clean up unused memory
        del bp, hr, ecg_emg, structured_data  # Delete tensors after use
        torch.cuda.empty_cache()  # Ensure memory is freed

Thank you!

KFrank · January 25, 2025, 8:33pm

Hi Jay!

Your main problem is that 24 hours sampled at 1000 Hz is a lot. You will need to
deal with this head on in order to avoid memory problems.

This is perverse. Not only does converting a “signal” into an image needlessly
increase the size of the data, but it also requires your network to learn that the
images it is being given are images of graphs of “signals” before it can start learning
what those signals mean.

I assume that the 3 means that your are converting your “signals” into three-channel,
RGB images. This makes your input data an additional three times larger, seemingly
without any benefit.

24 hours of 30-minute data would be 48 data points – easily manageable.

But again, even though there is no memory challenge in converting 48 data points
into an image, there is no benefit to doing so and doing so will actually make it
harder for your network to learn what the input data mean.

Leaving the image aside – which you shouldn’t be doing anyway – your core problem
is the very large size of your ECG / EMG data. Assuming that you represent the
value of one data point of an ECG signal as a four-byte float, a single ECG trace
at 1000 Hz over 24 hours is about 345 MBytes. With multiple traces and the additional
memory needed for backpropagation, it will be very challenging to process such data
in gpu memory.

Making some speculative assumptions about your use case, perhaps you can sample
at a much lower rate than 1000 Hz. Your heart beats about once per second. Would
it work to sample at, say, 10 Hz? That would give you something like ten points per
heartbeat. Would you actually be losing any relevant information if you sampled at
such a resolution?

(You might also be able to bin the data. For example, you could take 10-Hz bins
(that is, bins of length one tenth of a second) and process, say, four numbers per
bin, such as the high, low, average, and final values in each tenth-of-a-second bin.
The rationale here – maybe or maybe not relevant to your use case – is that you
may need to sample at a high rate in order not to miss a transient high or low value.
By looking at the high and low in each bin, you wouldn’t miss such values. Even if
you needed to capture such a transient high value, would you really need to know
the precise millisecond at which it occurred?)

I’m imagining that your classifier is supposed to predict something like “healthy” vs.
“had a heart attack” vs. “about to die.” Do you really need to look at an entire day’s
data to perform such a classification, or is your use case more looking for interesting
events that play out over, say, half an hour to an hour? In such a case, perhaps you
could run your network (both for training and inference) on half-hour sections of data.
What aspect of your use case requires an entire day’s worth of data to be seen by
the network all at the same time?

When I think about the real meaningful content of your use case, it doesn’t strike
me as being particularly memory intensive. If you work with the raw “signal” values,
rather than images, and address somehow the 1000-Hz sampling rate (which seems
like overkill to me), you should be able to train your network without excessive memory
demands.

Best.

K. Frank

jaykay · January 30, 2025, 9:06pm

Hi K. Frank! Thank you for your reply, I am sorry for the late response (I thought I will get an email should someone reply).

When I first started my research, I wanted to work with signals and not images, to predict stroke. I found a dataset that has ECG and EMG signals and only one research paper that uses this data to perform stroke classification. However, this particular study is transforming these signals into images. It is something I did not think about before but this paper is clearly doing that, whereas other information about how they dealt with the data is not given.

I would link the paper to you here but I am not sure if that is appropriate (I am new to forums), because I would appreciate help in interpreting the paper’s methodology. The workaround I found is that I discovered that I can use a maximum of 135 images/subject for the 81 subjects before crashing my GPU, so I down sampled the ECG_EMG data to 1.5 Hz which allows me to fit the 24 hours of data into 135 images.

I think the paper might have thought that images can capture better features, and the VGG16 model (which the paper uses) is really good in image classification.

Basically I just wanted to recreate the results gained by the study but there are missing steps that make everything more confusing. I will look into binning if it is relevant to my case, and thank you again for the tips and for reassuring me that converting these signals to images is an uncommon thing (as I really do not get why one would do that too).

Thank you,
Jalal