Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

building the code in the bug report as a 64bit binary and various system information: http://gist.github.com/483494

and

testing harness, scripts to build it, and to run it: http://gist.github.com/483524 -- yes i was too lazy to make a makefile.

you will still need to construct some command line fu to separate the results into separate files so you can load it into whatever maths program you want.



Thanks for the information.

Your microbenchmark appears to be alignment sensitive. With your assembly code on my machine (quad-core 2.66 Core2 Quad) running for 250 tests I get:

  test1 usecs: avg 1.16759e+06 stddev 13917.6
  test2 usecs: avg 1.31382e+06 stddev 405.725
which are similar results to yours.

But if you add .align 8 right before the definition of test2 in the assembly file (i.e. make it be 8-byte aligned, just like test1), I get the following numbers:

  test1 usecs: avg 1.15972e+06 stddev 17004
  test2 usecs: avg 1.1264e+06 stddev 754.44
so the code that "doesn't use frame pointers" is actually slightly faster, as you might expect.

Additionally, if I simply modify your testcase to use 16-byte alignment, rather than 8-byte alignment, I get the following numbers:

  test1 usecs: avg 1.15895e+06 stddev 15764.7
  test2 usecs: avg 1.12657e+06 stddev 941.606
I think aligning both test functions by 8 bytes at least makes things fair, but you can see that minor changes in alignment can cause big changes.

You can see the assembly sources I used: http://gist.github.com/483840

FWIW, the code that uses movs rather than pushes and pops ought to be faster since (generally speaking for larger prologues and epilogues) you can execute a series of movs in parallel, whereas your pushes and pops are serialized, since they're all updating a common resource (the stack pointer). Empirical testing on benchmarks like SPEC2k has borne this out, both on x86 and x86-64. (You ought to be able to see this effect with gcc, depending on what cpu you use for the -mtune switch.) As you noted, this strategy carries a size penalty, since movs are somewhat larger than pushes and pops.

I'll also note that on my machine, with gcc saying it's:

  @nightcrawler:~$ gcc --version
  gcc (Ubuntu 4.4.3-4ubuntu5) 4.4.3
I get identical assembly for compiling the testcase from the PR with and without -fomit-frame-pointer (I should have noted the gcc version I was using, just as you did. My bad.) Furthermore, for:

  @nightcrawler:~$ gcc-4.3 --version
  gcc-4.3 (Ubuntu 4.3.4-10ubuntu1) 4.3.4
I also get identical assembly. On one of the servers at work, with:

  @nightcrawler:~$ ssh henry7 gcc --version
  gcc (GCC) 4.2.4 (Ubuntu 4.2.4-1ubuntu4)
I get identical assembly. Finally, also at work, with:

  @nightcrawler:~$ ssh henry7 /usr/local/tools/gcc-4.3.3/bin/i686-pc-linux-gnu-gcc --version
  i686-pc-linux-gnu-gcc (Sourcery G++ 4.3-83) 4.3.2
which is a somewhat patched version of GCC circa 4.3.2, I get identical assembly. So with four different flavors of GCC, there's no difference on the testcase in the PR with and without -fomit-frame-pointer. I'd be willing to bet that there's no differences with 4.5.x and mainline GCC as well. It looks like Debian may just have a peculiar set of patches to its version of GCC.

EDIT: formatting fixes.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: