# 为什么我不能用 numba (cuda python) 得到正确的一维数组的总和?

why can#39;t I get the right sum of 1D array with numba (cuda python)?(为什么我不能用 numba (cuda python) 得到正确的一维数组的总和?)

### 问题描述

I try to use cuda python with numba. The code is to calculate the sum of a 1D array as follows, but I don't know how to get one value result rather than three values.

python3.5 与 numba+ CUDA8.0

python3.5 with numba + CUDA8.0

``````import os,sys,time
import pandas as pd
import numpy as np
from numba import cuda, float32

os.environ['NUMBAPRO_NVVM']=r'D:NVIDIA GPU Computing ToolkitCUDAv8.0
vvmin
vvm64_31_0.dll'
os.environ['NUMBAPRO_LIBDEVICE']=r'D:NVIDIA GPU Computing ToolkitCUDAv8.0
vvmlibdevice'

bpg = (1,1)
tpb = (1,3)

@cuda.jit
def calcu_sum(D,T):
bh = cuda.blockDim.y
index_i = ty
L = len(D)
su = 0
while index_i<L:
su +=D[index_i]
index_i +=bh
print('su:',su)
T[0,0]=su
print('T:',T[0,0])

D = np.array([ 0.42487645,0.41607881,0.42027071,0.43751907,0.43512794,0.43656972,
0.43940639,0.43864551,0.43447691,0.43120232], dtype=np.float32)
T = np.empty([1,1])
print('D: ',D)

stream = cuda.stream()
with stream.auto_synchronize():
dD = cuda.to_device(D, stream)
dT= cuda.to_device(TE, stream)
calcu_sum[bpg, tpb, stream](dD,dT)
``````

``````D:  [ 0.42487645  0.41607881  0.42027071  0.43751907  0.43512794  0.43656972
0.43940639  0.43864551  0.43447691  0.43120232]
su:  1.733004
su:  1.289852
su:  1.291317
T: 1.733004
T: 1.289852
T: 1.291317
``````

Why can't I get the output "4.31417383" rather than "1.733004 1.289852 1.291317" ? 1.733004+1.289852+1.291317=4.314173.

I'm new to numba, read the numba documentation, but don't know how to do it. Can someone give advice ?

### 推荐答案

The reason you don't get the sum you expect is because you haven't written code to produce that sum.

The basic CUDA programming model (whether you use CUDA C, Fortran or Python as your language) is that you write kernel code which is executed by each thread. You have written code for each thread to read and sum part of the input array. You have not written any code for those threads to share and sum their individual partial sums into a final sum.

There is an extremely well described algorithm for doing this -- it is called a parallel reduction. You can find an introduction to the algorithm in a PDF which ships in the examples of every version of the CUDA toolkit, or download a presentation about it here. You can also read a more modern version of the algorithm which uses newer features of CUDA (warp shuffle instructions and atomic transactions) here.

After you have studied the reduction algorithm, you will need to adapt the standard CUDA C kernel code into the Numba Python kernel dialect. At the bare minimum, something like this:

``````tpb = (1,3)

@cuda.jit
def calcu_sum(D,T):

bh = cuda.blockDim.y
index_i = ty
sbuf = cuda.shared.array(tpb, float32)

L = len(D)
su = 0
while index_i < L:
su += D[index_i]
index_i +=bh

print('su:',su)

sbuf[0,ty] = su

if ty == 0:
T[0,0] = 0
for i in range(0, bh):
T[0,0] += sbuf[0,i]
print('T:',T[0,0])
``````

will probably do what you want, although it is still a long way from an optimal parallel shared memory reduction, as you will see when you read the material I provided links to.

# 相关文档推荐

python arbitrarily incrementing an iterator inside a loop(python在循环内任意递增迭代器)
Joining a set of ordered-integer yielding Python iterators(加入一组产生 Python 迭代器的有序整数)
Iterating over dictionary items(), values(), keys() in Python 3(在 Python 3 中迭代字典 items()、values()、keys())
What is the Perl version of a Python iterator?(Python 迭代器的 Perl 版本是什么?)
How to create a generator/iterator with the Python C API?(如何使用 Python C API 创建生成器/迭代器?)
Python generator behaviour(Python 生成器行为)