c - gcc simple arithmetics loop performance -
c - gcc simple arithmetics loop performance -
the problem: 1 evidently line of code speeds programme twice.
this rather hard formulate original problem, comes bounds check elimination algorithm. so, simple test can't understand.
one evidently line of code leads speedup programme twice.
there next source:
#include <stdlib.h> #include <stdio.h> int main(void) { long = 0, = 0, x = 0; int = 200000000; int *values = malloc(sizeof(int)*up); (i = 0; < ; ++i) { values[i]=i % 2; } (i = 0; < ; ++i) { x = (a & i); #ifdef fast x = 0; #endif += values[x]; } printf ("a=%ld\n", a); homecoming 0; }/*main*/
in illustration value of 'a' 0. line x = 0; extra.
but, (look -- no optimizations!) $gcc -o0 -o short short.c && time ./short a=0 real 0m2.808s user 0m2.196s sys 0m0.596s $gcc -o0 -dfast -o short short.c && time ./short a=0 real 0m1.869s user 0m1.260s sys 0m0.608s and, reproducible on many compilers / optimization options , programme variations.
moreover, result in same assembler code except of putting stupid 0 register! e.g.:
gcc -s -o0 -dfast short.c && mv short.s shortfast.s gcc -s -o0 short.c && mv short.s shortslow.s diff shortfast.s shortslow.s 55d54 < movq $0, -24(%rbp)
and, bit after -- same effect on (all able test) other compilers / languages (including java jit). thing shared -- x86-64 architecture. tested on both intel , amd processors...
short answer: storing 0 eliminates read-after-write dependency in 1 of loops.
details:
i thought interesting question, , although focused on o0 optimization level, same speedup seen @ o3 well. looking @ o0 makes easier focus on processor doing optimize code rather compiler, because noted resulting assembly code differs 1 instruction.
the assembly code loop of involvement shown below
movq $0, -32(%rbp) jmp .l4 .l5: movq -32(%rbp), %rax movq -24(%rbp), %rdx andq %rdx, %rax movq %rax, -16(%rbp) movq $0, -16(%rbp) ;; instruction in fast not slow movq -16(%rbp), %rax leaq 0(,%rax,4), %rdx movq -8(%rbp), %rax addq %rdx, %rax movl (%rax), %eax cltq addq %rax, -24(%rbp) addq $1, -32(%rbp) .l4: movl -36(%rbp), %eax cltq cmpq -32(%rbp), %rax jg .l5
running perf stat
on scheme next results:
results slow code
performance counter stats './slow_o0': 1827.438670 task-clock # 0.999 cpus utilized 155 context-switches # 0.085 k/sec 1 cpu-migrations # 0.001 k/sec 195,448 page-faults # 0.107 m/sec 6,675,246,466 cycles # 3.653 ghz 4,391,690,661 stalled-cycles-frontend # 65.79% frontend cycles idle 1,609,321,845 stalled-cycles-backend # 24.11% backend cycles idle 7,157,837,211 instructions # 1.07 insns per cycle # 0.61 stalled cycles per insn 490,110,757 branches # 268.195 m/sec 178,287 branch-misses # 0.04% of branches 1.829712061 seconds time elapsed
results fast code
performance counter stats './fast_o0': 1109.451910 task-clock # 0.998 cpus utilized 95 context-switches # 0.086 k/sec 1 cpu-migrations # 0.001 k/sec 195,448 page-faults # 0.176 m/sec 4,067,613,078 cycles # 3.666 ghz 1,784,131,209 stalled-cycles-frontend # 43.86% frontend cycles idle 438,447,105 stalled-cycles-backend # 10.78% backend cycles idle 7,356,892,998 instructions # 1.81 insns per cycle # 0.24 stalled cycles per insn 489,945,197 branches # 441.610 m/sec 176,136 branch-misses # 0.04% of branches 1.111398442 seconds time elapsed
so can see though "fast" code executes more instructions has fewer stalls. when out-of-order cpu (like x64 architectures) executing code keeps track of dependencies between instructions. waiting instruction can bypassed instruction if operands ready.
in illustration critical point instruction sequence:
andq %rdx, %rax movq %rax, -16(%rbp) movq $0, -16(%rbp) ;; instruction in fast not slow movq -16(%rbp), %rax leaq 0(,%rax,4), %rdx movq -8(%rbp), %rax
in fast code movq -8(%rbp), %rax
instruction result movq $0, -16(%rbp)
forwarded , able execute sooner. whereas slower version have wait movq %rax, -16(%rbp)
has more dependencies between loop iterations.
note without knowing more specific microarchitecture analysis simplistic. suspect underlying cause dependency , performing store of 0 (the movq $0, -16(%rbp)
instruction) allows cpu perform more aggressive speculation when executing code sequence.
c gcc
Comments
Post a Comment