Puzzling phenomena

prairie-guy · January 27, 2018, 9:00pm

Puzzling phenomena

I am using Pytorch with the ‘fastai’ library supporting the deep learning course by the same name. I am using Ubuntu 16.04, a Titan V, and Anaconda 3.6.

Here is what puzzles me:

When I run the following code without any other jobs running, it is significantly slower than when the GPU is running other process. (Specifically, it is under heavy load running crypto mining software.) I have repeated the trials numerous times to make sure that there were no differences in pre-computing or caching taking place. Moreover, I have tested this off and on over several weeks with the same result. I have used nvidia-smi to verifying what jobs are running on the GPU. Here are the times:

Time with no load: 45 seconds
Time with load: 20 seconds

arch=resnet34 data = ImageClassifierData.from_paths(PATH, tfms=tfms_from_model(arch, sz)) learn = ConvLearner.pretrained(arch, data, precompute=True) learn.fit(0.01, 5)

This really doesn’t make sense to me.

I’ve searched around on this fourum and others and haven’t found out what might be taking place. I hope I haven’t made dumb mistake in my observations.

Thanks in advance for any thoughts.

ezyang · January 28, 2018, 5:07am

There are a few possible explanations: (1) the CUDA driver is not loaded in persistent mode, so you are benefitting from it being preloaded when another process is running (but this should only be a few seconds of difference); (2) the crypto mining software is maxing out your GPU’s application clocks, which will make your model run faster. You can test for this by setting nvidia-smi -ac yourself to the maximum clock values.

prairie-guy · January 29, 2018, 1:31am

Edward - Thanks very much for your response. I appreciate the advice to search through the various options provided by nvidia-smi. Before trying to change settings, I used nvidia-smi -q to see if I could detect differences. I really couldn’t find anything, but I’m not all that familiar with CUDA and GPU settings. In the hopes that you might notice something, here are the results of running with-no-load and with-load. (The former running deep learning only and the later running both deep learning and crypto mining.)

My GPU is a Titan V

nvidia-smi -i 0 -q

==============NVSMI LOG==============

Timestamp                           : Sun Jan 28 19:02:35 2018
Driver Version                      : 387.34

Attached GPUs                       : 2
GPU 00000000:01:00.0
    Product Name                    : Graphics Device
    Product Brand                   : GeForce
    Display Mode                    : Enabled
    Display Active                  : Enabled
    Persistence Mode                : Disabled
    Accounting Mode                 : Disabled
    Accounting Mode Buffer Size     : 1920
    Driver Model
        Current                     : N/A
        Pending                     : N/A
    Serial Number                   : 0324917146221
    GPU UUID                        : GPU-a4aaecfb-979f-4ff7-c1b9-96acbc39b5fc
    Minor Number                    : 0
    VBIOS Version                   : 88.00.36.00.01
    MultiGPU Board                  : No
    Board ID                        : 0x100
    GPU Part Number                 : 900-1G500-2500-000
    Inforom Version
        Image Version               : G001.0000.01.04
        OEM Object                  : 1.1
        ECC Object                  : N/A
        Power Management Object     : N/A
    GPU Operation Mode
        Current                     : N/A
        Pending                     : N/A
    GPU Virtualization Mode
        Virtualization mode         : None
    PCI
        Bus                         : 0x01
        Device                      : 0x00
        Domain                      : 0x0000
        Device Id                   : 0x1D8110DE
        Bus Id                      : 00000000:01:00.0
        Sub System Id               : 0x121810DE
        GPU Link Info
            PCIe Generation
                Max                 : 2
                Current             : 2
            Link Width
                Max                 : 16x
                Current             : 16x
        Bridge Chip
            Type                    : N/A
            Firmware                : N/A
        Replays since reset         : 0
        Tx Throughput               : 2000 KB/s
        Rx Throughput               : 4000 KB/s
    Fan Speed                       : 35 %
    Performance State               : P2
    Clocks Throttle Reasons
        Idle                        : Not Active
        Applications Clocks Setting : Not Active
        SW Power Cap                : Not Active
        HW Slowdown                 : Not Active
            HW Thermal Slowdown     : Not Active
            HW Power Brake Slowdown : Not Active
        Sync Boost                  : Not Active
        SW Thermal Slowdown         : Not Active
    FB Memory Usage
        Total                       : 12057 MiB
        Used                        : 1946 MiB
        Free                        : 10111 MiB
    BAR1 Memory Usage
        Total                       : 256 MiB
        Used                        : 9 MiB
        Free                        : 247 MiB
    Compute Mode                    : Default
    Utilization
        Gpu                         : 2 %
        Memory                      : 2 %
        Encoder                     : 0 %
        Decoder                     : 0 %
    Encoder Stats
        Active Sessions             : 0
        Average FPS                 : 0
        Average Latency             : 0
    Ecc Mode
        Current                     : N/A
        Pending                     : N/A
    ECC Errors
        Volatile
            Single Bit            
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                CBU                 : N/A
                Total               : N/A
            Double Bit            
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                CBU                 : N/A
                Total               : N/A
        Aggregate
            Single Bit            
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                CBU                 : N/A
                Total               : N/A
            Double Bit            
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                CBU                 : N/A
                Total               : N/A
    Retired Pages
        Single Bit ECC              : N/A
        Double Bit ECC              : N/A
        Pending                     : N/A
    Temperature
        GPU Current Temp            : 49 C
        GPU Shutdown Temp           : 100 C
        GPU Slowdown Temp           : 97 C
        GPU Max Operating Temp      : 91 C
        Memory Current Temp         : 45 C
        Memory Max Operating Temp   : 95 C
    Power Readings
        Power Management            : Supported
        Power Draw                  : 36.18 W
        Power Limit                 : 250.00 W
        Default Power Limit         : 250.00 W
        Enforced Power Limit        : 250.00 W
        Min Power Limit             : 100.00 W
        Max Power Limit             : 300.00 W
    Clocks
        Graphics                    : 1200 MHz
        SM                          : 1200 MHz
        Memory                      : 850 MHz
        Video                       : 1080 MHz
    Applications Clocks
        Graphics                    : 1200 MHz
        Memory                      : 850 MHz
    Default Applications Clocks
        Graphics                    : 1200 MHz
        Memory                      : 850 MHz
    Max Clocks
        Graphics                    : 1912 MHz
        SM                          : 1912 MHz
        Memory                      : 850 MHz
        Video                       : 1717 MHz
    Max Customer Boost Clocks
        Graphics                    : N/A
    Clock Policy
        Auto Boost                  : N/A
        Auto Boost Default          : N/A
    Processes
        Process ID                  : 6246
            Type                    : G
            Name                    : /usr/lib/xorg/Xorg
            Used GPU Memory         : 169 MiB
        Process ID                  : 26771
            Type                    : C
            Name                    : /home/cdaniels/anaconda3/envs/fastai/bin/python
            Used GPU Memory         : 1764 MiB

diff (between *with-no-load* and *with-load*)

4c4
< Timestamp                           : Sun Jan 28 19:02:35 2018
---
> Timestamp                           : Sun Jan 28 19:03:33 2018
54,55c54,55
<         Tx Throughput               : 2000 KB/s
<         Rx Throughput               : 4000 KB/s
---
>         Tx Throughput               : 56000 KB/s
>         Rx Throughput               : 109000 KB/s
69,70c69,70
<         Used                        : 1946 MiB
<         Free                        : 10111 MiB
---
>         Used                        : 4714 MiB
>         Free                        : 7343 MiB
73,74c73,74
<         Used                        : 9 MiB
<         Free                        : 247 MiB
---
>         Used                        : 13 MiB
>         Free                        : 243 MiB
77,78c77,78
<         Gpu                         : 2 %
<         Memory                      : 2 %
---
>         Gpu                         : 100 %
>         Memory                      : 100 %
132c132
<         GPU Current Temp            : 49 C
---
>         GPU Current Temp            : 53 C
136c136
<         Memory Current Temp         : 45 C
---
>         Memory Current Temp         : 60 C
140c140
<         Power Draw                  : 36.18 W
---
>         Power Draw                  : 110.65 W
175a176,179
>         Process ID                  : 26975
>             Type                    : C
>             Name                    : ./ethminer
>             Used GPU Memory         : 2752 MiB

Thanks again for looking at this.

ezyang · January 29, 2018, 5:45am

Try running:

nvidia-smi -pm ENABLED
nvidia-smi -ac 850,1912

See https://devblogs.nvidia.com/increase-performance-gpu-boost-k80-autoboost/ for more details.

prairie-guy · January 29, 2018, 7:23am

Edward,

I tried these commands and the changes were successfully executed with appropriate response from nvidia-smi. There was, however, no change in performance:

with-no-load (deep learning only): 42 seconds
with-load (deep learning AND crytomining): 17 seconds

Is it possible that the GPU needs to be operating at a minimum speed before it shifts into a higher speed?

I keep doing this and keep getting the same results. Why wouldn’t I be getting better performance running less on the GPU? Seems weird that I have to spin up a seemingly unrelated gpu intensive process (crypto mining), before getting peak performance from my deep learning models.

There must be something in the crypto mining code to boost performance or something in the fast.ai library that is limiting performance. Is there standard Pytorch benchmarking code I might use to narrow this down?

Thanks again for helping on this.

Bryan