c - gcc simple arithmetics loop performance -



c - gcc simple arithmetics loop performance -

the problem: 1 evidently line of code speeds programme twice.

this rather hard formulate original problem, comes bounds check elimination algorithm. so, simple test can't understand.

one evidently line of code leads speedup programme twice.

there next source:

#include <stdlib.h> #include <stdio.h> int main(void) { long = 0, = 0, x = 0; int = 200000000; int *values = malloc(sizeof(int)*up); (i = 0; < ; ++i) { values[i]=i % 2; } (i = 0; < ; ++i) { x = (a & i); #ifdef fast x = 0; #endif += values[x]; } printf ("a=%ld\n", a); homecoming 0; }/*main*/

in illustration value of 'a' 0. line x = 0; extra.

but, (look -- no optimizations!) $gcc -o0 -o short short.c && time ./short a=0 real 0m2.808s user 0m2.196s sys 0m0.596s $gcc -o0 -dfast -o short short.c && time ./short a=0 real 0m1.869s user 0m1.260s sys 0m0.608s and, reproducible on many compilers / optimization options , programme variations.

moreover, result in same assembler code except of putting stupid 0 register! e.g.:

gcc -s -o0 -dfast short.c && mv short.s shortfast.s gcc -s -o0 short.c && mv short.s shortslow.s diff shortfast.s shortslow.s 55d54 < movq $0, -24(%rbp)

and, bit after -- same effect on (all able test) other compilers / languages (including java jit). thing shared -- x86-64 architecture. tested on both intel , amd processors...

short answer: storing 0 eliminates read-after-write dependency in 1 of loops.

details:

i thought interesting question, , although focused on o0 optimization level, same speedup seen @ o3 well. looking @ o0 makes easier focus on processor doing optimize code rather compiler, because noted resulting assembly code differs 1 instruction.

the assembly code loop of involvement shown below

movq $0, -32(%rbp) jmp .l4 .l5: movq -32(%rbp), %rax movq -24(%rbp), %rdx andq %rdx, %rax movq %rax, -16(%rbp) movq $0, -16(%rbp) ;; instruction in fast not slow movq -16(%rbp), %rax leaq 0(,%rax,4), %rdx movq -8(%rbp), %rax addq %rdx, %rax movl (%rax), %eax cltq addq %rax, -24(%rbp) addq $1, -32(%rbp) .l4: movl -36(%rbp), %eax cltq cmpq -32(%rbp), %rax jg .l5

running perf stat on scheme next results:

results slow code

performance counter stats './slow_o0': 1827.438670 task-clock # 0.999 cpus utilized 155 context-switches # 0.085 k/sec 1 cpu-migrations # 0.001 k/sec 195,448 page-faults # 0.107 m/sec 6,675,246,466 cycles # 3.653 ghz 4,391,690,661 stalled-cycles-frontend # 65.79% frontend cycles idle 1,609,321,845 stalled-cycles-backend # 24.11% backend cycles idle 7,157,837,211 instructions # 1.07 insns per cycle # 0.61 stalled cycles per insn 490,110,757 branches # 268.195 m/sec 178,287 branch-misses # 0.04% of branches 1.829712061 seconds time elapsed

results fast code

performance counter stats './fast_o0': 1109.451910 task-clock # 0.998 cpus utilized 95 context-switches # 0.086 k/sec 1 cpu-migrations # 0.001 k/sec 195,448 page-faults # 0.176 m/sec 4,067,613,078 cycles # 3.666 ghz 1,784,131,209 stalled-cycles-frontend # 43.86% frontend cycles idle 438,447,105 stalled-cycles-backend # 10.78% backend cycles idle 7,356,892,998 instructions # 1.81 insns per cycle # 0.24 stalled cycles per insn 489,945,197 branches # 441.610 m/sec 176,136 branch-misses # 0.04% of branches 1.111398442 seconds time elapsed

so can see though "fast" code executes more instructions has fewer stalls. when out-of-order cpu (like x64 architectures) executing code keeps track of dependencies between instructions. waiting instruction can bypassed instruction if operands ready.

in illustration critical point instruction sequence:

andq %rdx, %rax movq %rax, -16(%rbp) movq $0, -16(%rbp) ;; instruction in fast not slow movq -16(%rbp), %rax leaq 0(,%rax,4), %rdx movq -8(%rbp), %rax

in fast code movq -8(%rbp), %rax instruction result movq $0, -16(%rbp) forwarded , able execute sooner. whereas slower version have wait movq %rax, -16(%rbp) has more dependencies between loop iterations.

note without knowing more specific microarchitecture analysis simplistic. suspect underlying cause dependency , performing store of 0 (the movq $0, -16(%rbp) instruction) allows cpu perform more aggressive speculation when executing code sequence.

c gcc

Comments

Popular posts from this blog

Delphi change the assembly code of a running process -

json - Hibernate and Jackson (java.lang.IllegalStateException: Cannot call sendError() after the response has been committed) -

C++ 11 "class" keyword -