TCP bandwidth is strange

I use the send and recv methods provided in the distributed module. I find some strange results when I use the send and recv to test the bandwidth.
The PyTorch versions are 0.3.1 and 0.4.0, system is ubuntu 16.04, the Ethernet is 40G.

First, I use the send and recv directly in the script, the script is here:

model = models.alexnet()
size = sum([ for p in model.parameters()])
dist.init_process_group(rank=args.rank, backend='tcp', init_method='tcp://*******:23456', world_size=2)

param = [ for p in model.parameters()]
c = _flatten_dense_tensors(param)

def sendrecv(size,c):
    # data = torch.FloatTensor(int(size))
    data = torch.FloatTensor(c)
    # print(data)
    start_time = time.time()
    n =1
    for i in range(n):
        if dist.get_rank() == 0:
            dist.send(data, dst=1)
            dist.recv(data, src=0)
    t = time.time() - start_time
    print('time', t)
    print('bandwidth is:', 4.0 * size * n/t / 1024 / 1024 / 1024, 'GB/s')

for I in range(10):
   sendrecv(size, c)

I run the script many times, and use the different data.
data= torch.FloatTensor(int(size)), the bandwidth may reach to 10G. While, data= torch.FloatTensor(c), the bandwidth is only up to 1G. Why? I didn’t find the data compress in the DataChannelTCP.cpp for send and recv.

Then, I add some count code in the DataChannelTCP.cpp and ChannelUtils.hpp, and build the PyTorch from source

// send data (bytes)
  std::time_t start_time, end_time;
  start_time = std::time(NULL);
  for (int i = 0; i < 10; i++){
      reinterpret_cast<const std::uint8_t*>(data.data_ptr()),
  end_time = std::time(NULL);
  std::cout<<"sizeof(std::uint8_t)" << sizeof(std::uint8_t) << std::endl;
  std::cout<<"size"<< tensor_bytes << std::endl;
  std::cout<<"time"<< end_time - start_time << std::endl;
  std::cout<<"bandwidth:"<< sizeof(std::uint8_t)*tensor_bytes*10.0/1024/1024/1024/(end_time-start_time)<< "GB/s \n";


while (bytes_to_send > 0) {
    ssize_t bytes_sent;
    SYSCHECK(bytes_sent = ::send(socket, current_bytes, bytes_to_send, flags))
    std::cout << "bytes_sent: "<< bytes_sent<< std::endl;
    if (bytes_sent == 0)
      throw std::system_error(ECONNRESET, std::system_category());

    bytes_to_send -= bytes_sent;
    current_bytes += bytes_sent;

I get the similar result from the first test. I didn’t find any data compression method in DataChannel.Cpp or ChannelUtils.hpp. I also post some results.
data = Torch.FloatTensor©

time 2.312767267227173
bandwidth is: 0.09841818919928345 GB/s
time 1.777893304824829
bandwidth is: 0.12802701144223064 GB/s
time 4.207159757614136
bandwidth is: 0.05410262019832476 GB/s
time 3.301678419113159
bandwidth is: 0.06894019876745375 GB/s
time 3.5570905208587646
bandwidth is: 0.06399004049661386 GB/s
time 2.7732274532318115
bandwidth is: 0.08207706375278236 GB/s
time 2.344968795776367
bandwidth is: 0.09706669312182223 GB/s
time 2.197126626968384
bandwidth is: 0.10359820125339961 GB/s
time 2.823824882507324
bandwidth is: 0.08060640299966734 GB/s
time 1.8318305015563965
bandwidth is: 0.12425732964184186 GB/s

data= Torch.FloatTensor(int(size))

time 0.6392817497253418
bandwidth is: 0.35605328413906173 GB/s
time 1.0521376132965088
bandwidth is: 0.216338968974515 GB/s
time 0.5532073974609375
bandwidth is: 0.41145213806716313 GB/s
time 0.11163854598999023
bandwidth is: 2.038886877837718 GB/s
time 0.10956907272338867
bandwidth is: 2.0773961193822 GB/s
time 0.1103818416595459
bandwidth is: 2.0620997354068793 GB/s
time 0.10953903198242188
bandwidth is: 2.0779658388472924 GB/s
time 0.11667037010192871
bandwidth is: 1.9509526393120684 GB/s
time 0.11393547058105469
bandwidth is: 1.997783177785218 GB/s
time 0.11133456230163574
bandwidth is: 2.0444537776435796 GB/s

Any one can give some suggestion or help? I don’t know why the bandwidth utilisation is so low.

I really appreciate any help.

@apaszke Can u give some suggestions?

Anyone faces the same question?