Understanding the calculation of memory bandwidth

I come across this blog and couldn’t wrap my head around this calculation.

For simple operators, it's feasible to reason about your memory bandwidth directly. For example, an A100 has 1.5 terabytes/second of global memory bandwidth, and can perform 19.5 teraflops/second of compute. So, if you're using 32 bit floats (i.e. 4 bytes), you can load in 400 billion numbers in the same time that the GPU can perform 20 trillion operations.

How the value of 400 billion and 20 trillion operations is calculated here?

“400 billion numbers” seems to be picked as an example and the corresponding “20 trillion operations” are then calculated by multiplying the compute of “19.5 teraflops/second” by the data transfer time.
Here is a quick example:

mem_bandwidth =  1.5 * 1024**4
compute = 19.5 * 1024**4
data = 400 * 4 * 1e9
transfer_time = data / mem_bandwidth
ops_in_transfer_time = transfer_time * compute
print(ops_in_transfer_time / 1e12)
# 20.8
1 Like