testing harness, scripts to build it, and to run it: http://gist.github.com/483524 -- yes i was too lazy to make a makefile.
you will still need to construct some command line fu to separate the results into separate files so you can load it into whatever maths program you want.
But if you add .align 8 right before the definition of test2 in the assembly file (i.e. make it be 8-byte aligned, just like test1), I get the following numbers:
FWIW, the code that uses movs rather than pushes and pops ought to be faster since (generally speaking for larger prologues and epilogues) you can execute a series of movs in parallel, whereas your pushes and pops are serialized, since they're all updating a common resource (the stack pointer). Empirical testing on benchmarks like SPEC2k has borne this out, both on x86 and x86-64. (You ought to be able to see this effect with gcc, depending on what cpu you use for the -mtune switch.) As you noted, this strategy carries a size penalty, since movs are somewhat larger than pushes and pops.
I'll also note that on my machine, with gcc saying it's:
I get identical assembly for compiling the testcase from the PR with and without -fomit-frame-pointer (I should have noted the gcc version I was using, just as you did. My bad.) Furthermore, for:
which is a somewhat patched version of GCC circa 4.3.2, I get identical assembly. So with four different flavors of GCC, there's no difference on the testcase in the PR with and without -fomit-frame-pointer. I'd be willing to bet that there's no differences with 4.5.x and mainline GCC as well. It looks like Debian may just have a peculiar set of patches to its version of GCC.
and
testing harness, scripts to build it, and to run it: http://gist.github.com/483524 -- yes i was too lazy to make a makefile.
you will still need to construct some command line fu to separate the results into separate files so you can load it into whatever maths program you want.