基准测试(python 与 C++ 使用 BLAS)和(numpy)

Benchmarking (python vs. c++ using BLAS) and (numpy)(基准测试(python 与 C++ 使用 BLAS)和(numpy))
本文介绍了基准测试(python 与 C++ 使用 BLAS)和(numpy)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着跟版网的小编来一起学习吧!

问题描述

我想编写一个广泛使用 BLAS 和 LAPACK 线性代数功能的程序.由于性能是一个问题,我做了一些基准测试,想知道我采用的方法是否合法.

可以这么说,我有三个参赛者,想用简单的矩阵乘法来测试他们的表现.参赛选手是:

  1. Numpy,仅使用 dot 的功能.
  2. Python,通过共享对象调用 BLAS 功能.
  3. C++,通过共享对象调用 BLAS 功能.

场景

我为不同的维度实现了矩阵乘法i.i 从 5 运行到 500,增量为 5,矩阵 m1m2 设置如下:

m1 = numpy.random.rand(i,i).astype(numpy.float32)m2 = numpy.random.rand(i,i).astype(numpy.float32)

1.麻木

使用的代码如下所示:

tNumpy = timeit.Timer("numpy.dot(m1, m2)", "import numpy; from __main__ import m1, m2")rNumpy.append((i, tNumpy.repeat(20, 1)))

2.Python,通过共享对象调用BLAS

带功能

_blaslib = ctypes.cdll.LoadLibrary("libblas.so")def Mul(m1, m2, i, r):no_trans = c_char("n")n = c_int(i)一 = c_float(1.0)零 = c_float(0.0)_blaslib.sgemm_(byref(no_trans), byref(no_trans), byref(n), byref(n), byref(n),byref(one), m1.ctypes.data_as(ctypes.c_void_p), byref(n),m2.ctypes.data_as(ctypes.c_void_p), byref(n), byref(zero),r.ctypes.data_as(ctypes.c_void_p), byref(n))

测试代码如下:

r = numpy.zeros((i,i), numpy.float32)tBlas = timeit.Timer("Mul(m1, m2, i, r)", "import numpy; from __main__ import i, m1, m2, r, Mul")rBlas.append((i, tBlas.repeat(20, 1)))

3.c++,通过共享对象调用BLAS

现在c++代码自然要长一点,所以我把信息减到最少.
我用

加载函数

void* handle = dlopen("libblas.so", RTLD_LAZY);void* Func = dlsym(handle, "sgemm_");

我用 gettimeofday 测量时间是这样的:

gettimeofday(&start, NULL);f(&no_trans, &no_trans, &dim, &dim, &dim, &one, A, &dim, B, &dim, &zero, Return, &dim);gettimeofday(&end, NULL);dTimes[j] = CalcTime(start, end);

其中 j 是运行 20 次的循环.我计算过去的时间

double CalcTime(timeval start, timeval end){双倍系数 = 1000000;return (((double)end.tv_sec) * factor + ((double)end.tv_usec) - (((double)start.tv_sec) * factor + ((double)start.tv_usec)))/factor;}

结果

结果如下图所示:

问题

  1. 您认为我的方法公平吗,或者我可以避免一些不必要的开销?
  2. 您是否希望结果会显示 c++ 和 python 方法之间存在如此巨大的差异?两者都使用共享对象进行计算.
  3. 既然我更愿意在我的程序中使用 python,那么在调用 BLAS 或 LAPACK 例程时我可以做些什么来提高性能?

下载

可以在此处下载完整的基准测试.(J.F.塞巴斯蒂安使该链接成为可能^^)

解决方案

我已经运行了你的基准测试

一>.我机器上的 C++ 和 numpy 没有区别:

<块引用>

你认为我的方法公平吗,或者我可以避免一些不必要的开销?

由于结果没有差异,所以看起来很公平.

<块引用>

您是否期望结果会显示 c++ 和 python 方法之间存在如此巨大的差异?两者都使用共享对象进行计算.

没有

<块引用>

既然我更愿意在我的程序中使用 python,那么在调用 BLAS 或 LAPACK 例程时我可以做些什么来提高性能?

确保 numpy 在您的系统上使用优化版本的 BLAS/LAPACK 库.

I would like to write a program that makes extensive use of BLAS and LAPACK linear algebra functionalities. Since performance is an issue I did some benchmarking and would like know, if the approach I took is legitimate.

I have, so to speak, three contestants and want to test their performance with a simple matrix-matrix multiplication. The contestants are:

  1. Numpy, making use only of the functionality of dot.
  2. Python, calling the BLAS functionalities through a shared object.
  3. C++, calling the BLAS functionalities through a shared object.

Scenario

I implemented a matrix-matrix multiplication for different dimensions i. i runs from 5 to 500 with an increment of 5 and the matricies m1 and m2 are set up like this:

m1 = numpy.random.rand(i,i).astype(numpy.float32)
m2 = numpy.random.rand(i,i).astype(numpy.float32)

1. Numpy

The code used looks like this:

tNumpy = timeit.Timer("numpy.dot(m1, m2)", "import numpy; from __main__ import m1, m2")
rNumpy.append((i, tNumpy.repeat(20, 1)))

2. Python, calling BLAS through a shared object

With the function

_blaslib = ctypes.cdll.LoadLibrary("libblas.so")
def Mul(m1, m2, i, r):

    no_trans = c_char("n")
    n = c_int(i)
    one = c_float(1.0)
    zero = c_float(0.0)

    _blaslib.sgemm_(byref(no_trans), byref(no_trans), byref(n), byref(n), byref(n), 
            byref(one), m1.ctypes.data_as(ctypes.c_void_p), byref(n), 
            m2.ctypes.data_as(ctypes.c_void_p), byref(n), byref(zero), 
            r.ctypes.data_as(ctypes.c_void_p), byref(n))

the test code looks like this:

r = numpy.zeros((i,i), numpy.float32)
tBlas = timeit.Timer("Mul(m1, m2, i, r)", "import numpy; from __main__ import i, m1, m2, r, Mul")
rBlas.append((i, tBlas.repeat(20, 1)))

3. c++, calling BLAS through a shared object

Now the c++ code naturally is a little longer so I reduce the information to a minimum.
I load the function with

void* handle = dlopen("libblas.so", RTLD_LAZY);
void* Func = dlsym(handle, "sgemm_");

I measure the time with gettimeofday like this:

gettimeofday(&start, NULL);
f(&no_trans, &no_trans, &dim, &dim, &dim, &one, A, &dim, B, &dim, &zero, Return, &dim);
gettimeofday(&end, NULL);
dTimes[j] = CalcTime(start, end);

where j is a loop running 20 times. I calculate the time passed with

double CalcTime(timeval start, timeval end)
{
double factor = 1000000;
return (((double)end.tv_sec) * factor + ((double)end.tv_usec) - (((double)start.tv_sec) * factor + ((double)start.tv_usec))) / factor;
}

Results

The result is shown in the plot below:

Questions

  1. Do you think my approach is fair, or are there some unnecessary overheads I can avoid?
  2. Would you expect that the result would show such a huge discrepancy between the c++ and python approach? Both are using shared objects for their calculations.
  3. Since I would rather use python for my program, what could I do to increase the performance when calling BLAS or LAPACK routines?

Download

The complete benchmark can be downloaded here. (J.F. Sebastian made that link possible^^)

解决方案

I've run your benchmark. There is no difference between C++ and numpy on my machine:

Do you think my approach is fair, or are there some unnecessary overheads I can avoid?

It seems fair due to there is no difference in results.

Would you expect that the result would show such a huge discrepancy between the c++ and python approach? Both are using shared objects for their calculations.

No.

Since I would rather use python for my program, what could I do to increase the performance when calling BLAS or LAPACK routines?

Make sure that numpy uses optimized version of BLAS/LAPACK libraries on your system.

这篇关于基准测试(python 与 C++ 使用 BLAS)和(numpy)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持跟版网!

本站部分内容来源互联网,如果有图片或者内容侵犯您的权益请联系我们删除!

相关文档推荐

Why does C++ compilation take so long?(为什么 C++ 编译需要这么长时间?)
Why is my program slow when looping over exactly 8192 elements?(为什么我的程序在循环 8192 个元素时很慢?)
C++ performance challenge: integer to std::string conversion(C++ 性能挑战:整数到 std::string 的转换)
Fast textfile reading in c++(在 C++ 中快速读取文本文件)
Is it better to use std::memcpy() or std::copy() in terms to performance?(就性能而言,使用 std::memcpy() 或 std::copy() 更好吗?)
Does the C++ standard mandate poor performance for iostreams, or am I just dealing with a poor implementation?(C++ 标准是否要求 iostreams 性能不佳,或者我只是在处理一个糟糕的实现?)