为什么我的程序在循环 8192 个元素时很慢?

Why is my program slow when looping over exactly 8192 elements?(为什么我的程序在循环 8192 个元素时很慢?)
本文介绍了为什么我的程序在循环 8192 个元素时很慢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着跟版网的小编来一起学习吧!

问题描述

这是有关程序的摘录.矩阵 img[][] 的大小为 SIZE×SIZE,初始化为:

Here is the extract from the program in question. The matrix img[][] has the size SIZE×SIZE, and is initialized at:

img[j][i] = 2 * j + i

然后,你创建一个矩阵res[][],这里的每个字段都是img矩阵中围绕它的9个字段的平均值.为简单起见,边框保留为 0.

Then, you make a matrix res[][], and each field in here is made to be the average of the 9 fields around it in the img matrix. The border is left at 0 for simplicity.

for(i=1;i<SIZE-1;i++) 
    for(j=1;j<SIZE-1;j++) {
        res[j][i]=0;
        for(k=-1;k<2;k++) 
            for(l=-1;l<2;l++) 
                res[j][i] += img[j+l][i+k];
        res[j][i] /= 9;
}

这就是程序的全部内容.为了完整起见,这里是之前的内容.后面没有代码.如您所见,这只是初始化.

That's all there's to the program. For completeness' sake, here is what comes before. No code comes after. As you can see, it's just initialization.

#define SIZE 8192
float img[SIZE][SIZE]; // input image
float res[SIZE][SIZE]; //result of mean filter
int i,j,k,l;
for(i=0;i<SIZE;i++) 
    for(j=0;j<SIZE;j++) 
        img[j][i] = (2*j+i)%8196;

基本上,当 SIZE 是 2048 的倍数时,这个程序会很慢,例如执行时间:

Basically, this program is slow when SIZE is a multiple of 2048, e.g. the execution times:

SIZE = 8191: 3.44 secs
SIZE = 8192: 7.20 secs
SIZE = 8193: 3.18 secs

编译器是 GCC.据我所知,这是因为内存管理,但我对这个主题并不太了解,这就是我在这里问的原因.

The compiler is GCC. From what I know, this is because of memory management, but I don't really know too much about that subject, which is why I'm asking here.

还有如何解决这个问题会很好,但如果有人能解释这些执行时间,我就已经很高兴了.

Also how to fix this would be nice, but if someone could explain these execution times I'd already be happy enough.

我已经知道 malloc/free,但问题不在于使用的内存量,而只是执行时间,所以我不知道这会有什么帮助.

I already know of malloc/free, but the problem is not amount of memory used, it's merely execution time, so I don't know how that would help.

推荐答案

不同之处是由以下相关问题的同一个超对齐问题引起的:

The difference is caused by the same super-alignment issue from the following related questions:

  • 为什么转置 512x512 矩阵比转置 513x513 矩阵慢得多?
  • 矩阵乘法:矩阵大小差异小,时序差异大

但这只是因为代码还有一个问题.

But that's only because there's one other problem with the code.

从原始循环开始:

for(i=1;i<SIZE-1;i++) 
    for(j=1;j<SIZE-1;j++) {
        res[j][i]=0;
        for(k=-1;k<2;k++) 
            for(l=-1;l<2;l++) 
                res[j][i] += img[j+l][i+k];
        res[j][i] /= 9;
}

首先注意两个内部循环是微不足道的.它们可以展开如下:

First notice that the two inner loops are trivial. They can be unrolled as follows:

for(i=1;i<SIZE-1;i++) {
    for(j=1;j<SIZE-1;j++) {
        res[j][i]=0;
        res[j][i] += img[j-1][i-1];
        res[j][i] += img[j  ][i-1];
        res[j][i] += img[j+1][i-1];
        res[j][i] += img[j-1][i  ];
        res[j][i] += img[j  ][i  ];
        res[j][i] += img[j+1][i  ];
        res[j][i] += img[j-1][i+1];
        res[j][i] += img[j  ][i+1];
        res[j][i] += img[j+1][i+1];
        res[j][i] /= 9;
    }
}

这样就留下了我们感兴趣的两个外循环.

So that leaves the two outer-loops that we're interested in.

现在我们可以看到这个问题的问题是相同的:为什么循环的顺序会在迭代二维时影响性能数组?

Now we can see the problem is the same in this question: Why does the order of the loops affect performance when iterating over a 2D array?

您正在按列而不是按行迭代矩阵.

You are iterating the matrix column-wise instead of row-wise.

要解决这个问题,您应该将两个循环互换.

To solve this problem, you should interchange the two loops.

for(j=1;j<SIZE-1;j++) {
    for(i=1;i<SIZE-1;i++) {
        res[j][i]=0;
        res[j][i] += img[j-1][i-1];
        res[j][i] += img[j  ][i-1];
        res[j][i] += img[j+1][i-1];
        res[j][i] += img[j-1][i  ];
        res[j][i] += img[j  ][i  ];
        res[j][i] += img[j+1][i  ];
        res[j][i] += img[j-1][i+1];
        res[j][i] += img[j  ][i+1];
        res[j][i] += img[j+1][i+1];
        res[j][i] /= 9;
    }
}

这完全消除了所有非顺序访问,因此您不再会在大的 2 次幂上随机减速.

This eliminates all the non-sequential access completely so you no longer get random slow-downs on large powers-of-two.

酷睿 i7 920 @ 3.5 GHz

原始代码:

8191: 1.499 seconds
8192: 2.122 seconds
8193: 1.582 seconds

交换外循环:

8191: 0.376 seconds
8192: 0.357 seconds
8193: 0.351 seconds

这篇关于为什么我的程序在循环 8192 个元素时很慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持跟版网!

本站部分内容来源互联网,如果有图片或者内容侵犯您的权益请联系我们删除!

相关文档推荐

Why does C++ compilation take so long?(为什么 C++ 编译需要这么长时间?)
C++ performance challenge: integer to std::string conversion(C++ 性能挑战:整数到 std::string 的转换)
Fast textfile reading in c++(在 C++ 中快速读取文本文件)
Is it better to use std::memcpy() or std::copy() in terms to performance?(就性能而言,使用 std::memcpy() 或 std::copy() 更好吗?)
Does the C++ standard mandate poor performance for iostreams, or am I just dealing with a poor implementation?(C++ 标准是否要求 iostreams 性能不佳,或者我只是在处理一个糟糕的实现?)
Is there any advantage of using map over unordered_map in case of trivial keys?(在关键的情况下,使用 map 比 unordered_map 有什么优势吗?)