Single node multiGPU + multiCPU parallelism?

FuriouslyCurious · March 18, 2017, 10:02am

I got the DataParallel() example for multi-GPU working successfully for my model.

However, I happen to have a 64-core Xeon Phi CPU, and I can’t stand looking at it sitting idle. How can I saturate the CPU by assigning it some work? In other words, can I split the workload on GPU0, GPU1, and CPU0-255 and sum the gradients on CPU?

Thank you!

jekbradbury · March 18, 2017, 6:40pm

I don’t know very much about Xeon Phi, but I believe that Intel has an MKL version that will parallelize BLAS calls over the whole chip, making it work like a single GPU device. I would do some experiments to compare speed for each part of your network, then use model parallelism to put submodules on the device they work best on. Or you could subclass/modify the code for DataParallel to allow the CPU (Phi) to be one of the included devices.

FuriouslyCurious · March 18, 2017, 7:12pm

Thanks James.

I had a subclassing script for Keras based on Kuza55’s script: it replicated models to /gpu0, /gpu1, and /cpu0-255.

I will look into subclassing data_parallel.py to use both CPU and GPU. What are the device IDs for CPUs inside PyTorch?

Keras kuza55 script:

github.com

kuza55/keras-extras/blob/master/utils/multi_gpu.py

from keras.layers import merge
from keras.layers.core import Lambda
from keras.models import Model

import tensorflow as tf

def make_parallel(model, gpu_count):
    def get_slice(data, idx, parts):
        shape = tf.shape(data)
        size = tf.concat([ shape[:1] // parts, shape[1:] ],axis=0)
        stride = tf.concat([ shape[:1] // parts, shape[1:]*0 ],axis=0)
        start = stride * idx
        return tf.slice(data, start, size)

    outputs_all = []
    for i in range(len(model.outputs)):
        outputs_all.append([])

    #Place a copy of the model on each GPU, each getting a slice of the batch
    for i in range(gpu_count):

This file has been truncated. show original

data_parallel.py script:

github.com

pytorch/pytorch/blob/master/torch/nn/parallel/data_parallel.py

import torch
from ..modules import Module
from .scatter_gather import scatter_kwargs, gather
from .replicate import replicate
from .parallel_apply import parallel_apply


class DataParallel(Module):
    r"""Implements data parallelism at the module level.

    This container parallelizes the application of the given module by
    splitting the input across the specified devices by chunking in the batch
    dimension. In the forward pass, the module is replicated on each device,
    and each replica handles a portion of the input. During the backwards
    pass, gradients from each replica are summed into the original module.

    The batch size should be larger than the number of GPUs used. It should
    also be an integer multiple of the number of GPUs so that each chunk is the
    same size (so that each GPU processes the same number of samples).

This file has been truncated. show original

jekbradbury · March 18, 2017, 8:28pm

CPUs don’t have device IDs; there’s just one kind of CPU tensor and operations with them are implemented in TH and farmed out to MKL, which would then have its own strategy for parallelizing over the Phi.