# 如何使用 numba 在 GPU 上泛化快速矩阵乘法

How to generalize fast matrix multiplication on GPU using numba(如何使用 numba 在 GPU 上泛化快速矩阵乘法)

### 问题描述

Lately I've been trying to get into programming for GPUs in Python using the Numba library. I have been reading up on it on their website using the tutorial there and currently I'm stuck on their example, which can be found here: https://numba.pydata.org/numba-doc/latest/cuda/examples.html. I'm attempting to generalize the example for the fast matrix multiplication a bit (which is of the form A*B=C). When testing I noticed that matrices with dimensions that are not perfectly divisible by the number of threads per block (TPB) do not yield a correct answer.

I copied the code below from the example at https://numba.pydata.org/numba-doc/latest/cuda/examples.html and created a very small test case with 4 by 4 matrices. If I choose TPB=2 everything is fine, but when I set TPB=3 it goes wrong. I understand that the code goes out of the bounds of the matrices, but I am unable to prevent this from happening (I tried some if statements on ty + i * TPB and tx + i * TPB, but these did not work.

``````from numba import cuda, float32
import numpy as np
import math

@cuda.jit
def fast_matmul(A, B, C):
# Define an array in the shared memory
# The size and type of the arrays must be known at compile time
sA = cuda.shared.array(shape=(TPB, TPB), dtype=float32)
sB = cuda.shared.array(shape=(TPB, TPB), dtype=float32)

x, y = cuda.grid(2)

bpg = cuda.gridDim.x    # blocks per grid

if x >= C.shape and y >= C.shape:
# Quit if (x, y) is outside of valid C boundary
return

# Each thread computes one element in the result matrix.
# The dot product is chunked into dot products of TPB-long vectors.
tmp = 0.
for i in range(bpg):
# Preload data into shared memory
sA[tx, ty] = A[x, ty + i * TPB]
sB[tx, ty] = B[tx + i * TPB, y]

# Computes partial product on the shared memory
for j in range(TPB):
tmp += sA[tx, j] * sB[j, ty]

# Wait until all threads finish computing

C[x, y] = tmp

#%%

x_h = np.arange(16).reshape([4,4])
y_h = np.ones([4,4])
z_h = np.zeros([4,4])

x_d = cuda.to_device(x_h)
y_d = cuda.to_device(y_h)
z_d = cuda.to_device(z_h)

TPB = 3
blockspergrid = (blockspergrid_x, blockspergrid_y)

z_h = z_d.copy_to_host()
print(z_h)
``````

I would like to write some code that is not dependent on the matrices A, B, and C having dimensions that are perfectly divisible by the TPB, as these are sometimes out of my control. I understand that GPUs are only faster with matrix multiplication for very large matrices, but I wanted to use small examples to be able to check whether the answer is correct before applying it to actual data.

### 推荐答案

1. 这不可能是一个正确的范围检查:

1. This can't possibly be a correct range check:

``````if x >= C.shape and y >= C.shape:
``````

In order for us to decide that a particular thread in the grid not do any loading activity, we require either that `x` is out of range or that `y` is out of range. The `and` should have been an `or`.

It is illegal to use `cuda.syncthreads()` in conditional code, if all the threads in the block cannot participate in that statement. The previous `return` statement in item 1 above (even if corrected from `and` to `or`) pretty much guarantees this illegal behavior for problem sizes not whole-number-divisible by the threadblock size.

Therefore, to fix these issues, we cannot use just a simple `return` statement for an out-of-bounds thread. Instead, at the point of load, we must only allow threads to load from global to shared memory, if the computed global load indices (for `A` or `B`) are in-bounds (the shared indices are in-bounds, by definition). Furthermore, when writing a result, we must only write computed results that are in-bounds for `C`.

The following code has those items fixed. It seems to work correctly for your given test case:

``````\$ cat t49.py
from numba import cuda, float32
import numpy as np
import math

@cuda.jit
def fast_matmul(A, B, C):
# Define an array in the shared memory
# The size and type of the arrays must be known at compile time
sA = cuda.shared.array(shape=(TPB, TPB), dtype=float32)
sB = cuda.shared.array(shape=(TPB, TPB), dtype=float32)

x, y = cuda.grid(2)

bpg = cuda.gridDim.x    # blocks per grid

# Each thread computes one element in the result matrix.
# The dot product is chunked into dot products of TPB-long vectors.
tmp = float32(0.)
for i in range(bpg):
# Preload data into shared memory
sA[tx, ty] = 0
sB[tx, ty] = 0
if x < A.shape and (ty+i*TPB) < A.shape:
sA[tx, ty] = A[x, ty + i * TPB]
if y < B.shape and (tx+i*TPB) < B.shape:
sB[tx, ty] = B[tx + i * TPB, y]

# Computes partial product on the shared memory
for j in range(TPB):
tmp += sA[tx, j] * sB[j, ty]

# Wait until all threads finish computing
if x < C.shape and y < C.shape:
C[x, y] = tmp

#%%

x_h = np.arange(16).reshape([4,4])
y_h = np.ones([4,4])
z_h = np.zeros([4,4])

x_d = cuda.to_device(x_h)
y_d = cuda.to_device(y_h)
z_d = cuda.to_device(z_h)

TPB = 3
blockspergrid = (blockspergrid_x, blockspergrid_y)

z_h = z_d.copy_to_host()
print(z_h)
print(x_h@y_h)
\$ cuda-memcheck python t49.py
========= CUDA-MEMCHECK
[[ 6.  6.  6.  6.]
[22. 22. 22. 22.]
[38. 38. 38. 38.]
[54. 54. 54. 54.]]
[[ 6.  6.  6.  6.]
[22. 22. 22. 22.]
[38. 38. 38. 38.]
[54. 54. 54. 54.]]
========= ERROR SUMMARY: 0 errors
\$
``````

(注意这里在边界测试中使用 `和` 是正确的.测试一组索引是否在边界内与测试一组索引是否在外在布尔意义上是不同的-of-bounds.在 in-bounds 测试中，我们要求两者都是 in-bounds.在 out-of-bounds 测试中，任何一个 index out-of-bounds 都是不合格的).

(Note the use of `and` here in the bounds tests is correct. Testing whether a set of indices are in-bound is different in a boolean sense compared to testing whether a set of indices is out-of-bounds. In the in-bounds test, we require both to be in-bounds. In the out-of-bounds test, either index out-of-bounds is disqualifying).

I'm not suggesting the above code is defect-free or suitable for any particular purpose. It is offered to demonstrate possible fixes for the issues I identified. Getting a shared-memory tiled matrix multiply to work in every imaginable configuration is non-trivial, as you have discovered, and I've not tested it beyond what is shown here. (For example, if you decided to make TPB larger than 32, you would run into other problems. Also, the original posted code is advertised only for square matrix multiplication, and this will not work in the general non-square case.)

As noted above, the posted code and the above code with "fixes" will not correctly handle the general non-square case. I believe some straightforward modifications will allow us to handle the non-square case. In a nutshell, we must size the grid large enough to handle the dimensions of both input matrices, while still only writing results for the in-bounds values of the output matrix. Here is a lightly tested example:

``````\$ cat t49.py
from numba import cuda, float32
import numpy as np
import math

@cuda.jit
def fast_matmul(A, B, C):
# Define an array in the shared memory
# The size and type of the arrays must be known at compile time
sA = cuda.shared.array(shape=(TPB, TPB), dtype=float32)
sB = cuda.shared.array(shape=(TPB, TPB), dtype=float32)

x, y = cuda.grid(2)

bpg = cuda.gridDim.x    # blocks per grid

# Each thread computes one element in the result matrix.
# The dot product is chunked into dot products of TPB-long vectors.
tmp = float32(0.)
for i in range(bpg):
# Preload data into shared memory
sA[ty, tx] = 0
sB[ty, tx] = 0
if y < A.shape and (tx+i*TPB) < A.shape:
sA[ty, tx] = A[y, tx + i * TPB]
if x < B.shape and (ty+i*TPB) < B.shape:
sB[ty, tx] = B[ty + i * TPB, x]

# Computes partial product on the shared memory
for j in range(TPB):
tmp += sA[ty, j] * sB[j, tx]

# Wait until all threads finish computing
if y < C.shape and x < C.shape:
C[y, x] = tmp

#%%

x_h = np.arange(115).reshape([5,23])
y_h = np.ones([23,7])
z_h = np.zeros([5,7])

x_d = cuda.to_device(x_h)
y_d = cuda.to_device(y_h)
z_d = cuda.to_device(z_h)

#TPB must be an integer between 1 and 32
TPB = 32
grid_y_max = max(x_h.shape,y_h.shape)
grid_x_max = max(x_h.shape,y_h.shape)
blockspergrid = (blockspergrid_x, blockspergrid_y)

z_h = z_d.copy_to_host()
print(z_h)
print(x_h@y_h)
\$ cuda-memcheck python t49.py
========= CUDA-MEMCHECK
[[ 253.  253.  253.  253.  253.  253.  253.]
[ 782.  782.  782.  782.  782.  782.  782.]
[1311. 1311. 1311. 1311. 1311. 1311. 1311.]
[1840. 1840. 1840. 1840. 1840. 1840. 1840.]
[2369. 2369. 2369. 2369. 2369. 2369. 2369.]]
[[ 253.  253.  253.  253.  253.  253.  253.]
[ 782.  782.  782.  782.  782.  782.  782.]
[1311. 1311. 1311. 1311. 1311. 1311. 1311.]
[1840. 1840. 1840. 1840. 1840. 1840. 1840.]
[2369. 2369. 2369. 2369. 2369. 2369. 2369.]]
========= ERROR SUMMARY: 0 errors
\$
``````

I've also reordered the sense of `x` and `y` (and usage of `tx` and `ty`) to fix a performance issue in the above code. The same performance issue was present in the original posted doc code as well.

Again, no claims of defect free. Furthermore I'm sure "more optimal" code could be arrived at. However optimizing matrix multiplication is an exercise that should fairly quickly lead to using a library implementation. Using `cupy` here for the GPU approach should be a fairly straightforward way to tap into a high-quality matrix multiply routine on the GPU.

As discussed here OP's code (and, it seems, the doc example) also had a performance issue around the setup of the `tmp` variable. Changing that to a proper 32-bit float variable makes an important performance difference.

# 相关文档推荐

python arbitrarily incrementing an iterator inside a loop(python在循环内任意递增迭代器)
Joining a set of ordered-integer yielding Python iterators(加入一组产生 Python 迭代器的有序整数)
Iterating over dictionary items(), values(), keys() in Python 3(在 Python 3 中迭代字典 items()、values()、keys())
What is the Perl version of a Python iterator?(Python 迭代器的 Perl 版本是什么?)
How to create a generator/iterator with the Python C API?(如何使用 Python C API 创建生成器/迭代器?)
Python generator behaviour(Python 生成器行为)