c - gcc simple arithmetics loop performance -

the problem: 1 evidently line of code speeds programme twice.

this rather hard formulate original problem, comes bounds check elimination algorithm. so, simple test can't understand.

one evidently line of code leads speedup programme twice.

there next source:

#include <stdlib.h> #include <stdio.h>  int main(void) {    long = 0, = 0, x = 0;    int  = 200000000;     int *values = malloc(sizeof(int)*up);     (i = 0; < ; ++i)    {         values[i]=i % 2;    }    (i = 0; <  ; ++i)    {       x  =  (a & i); #ifdef fast       x = 0; #endif       += values[x];    }    printf ("a=%ld\n", a);     homecoming 0; }/*main*/

in illustration value of 'a' 0. line x = 0; extra.

but, (look -- no optimizations!) $gcc -o0 -o short short.c && time ./short a=0 real 0m2.808s user 0m2.196s sys 0m0.596s $gcc -o0 -dfast -o short short.c && time ./short a=0 real 0m1.869s user 0m1.260s sys 0m0.608s and, reproducible on many compilers / optimization options , programme variations.

moreover, result in same assembler code except of putting stupid 0 register! e.g.:

gcc -s -o0 -dfast short.c && mv short.s shortfast.s gcc -s -o0 short.c && mv short.s shortslow.s diff shortfast.s shortslow.s 55d54 < movq $0, -24(%rbp)

and, bit after -- same effect on (all able test) other compilers / languages (including java jit). thing shared -- x86-64 architecture. tested on both intel , amd processors...

short answer: storing 0 eliminates read-after-write dependency in 1 of loops.

details:

i thought interesting question, , although focused on o0 optimization level, same speedup seen @ o3 well. looking @ o0 makes easier focus on processor doing optimize code rather compiler, because noted resulting assembly code differs 1 instruction.

the assembly code loop of involvement shown below

  movq  $0, -32(%rbp)   jmp .l4 .l5:        movq  -32(%rbp), %rax   movq  -24(%rbp), %rdx   andq  %rdx, %rax        movq  %rax, -16(%rbp)   movq  $0, -16(%rbp)     ;; instruction in fast not slow   movq  -16(%rbp), %rax   leaq  0(,%rax,4), %rdx   movq  -8(%rbp), %rax     addq  %rdx, %rax         movl  (%rax), %eax       cltq                     addq  %rax, -24(%rbp)    addq  $1, -32(%rbp)  .l4:                       movl  -36(%rbp), %eax    cltq                     cmpq  -32(%rbp), %rax    jg  .l5

running perf stat on scheme next results:

results slow code

performance counter stats './slow_o0':     1827.438670 task-clock                #    0.999 cpus utilized                      155 context-switches          #    0.085 k/sec                                1 cpu-migrations            #    0.001 k/sec                          195,448 page-faults               #    0.107 m/sec                    6,675,246,466 cycles                    #    3.653 ghz                      4,391,690,661 stalled-cycles-frontend   #   65.79% frontend cycles idle     1,609,321,845 stalled-cycles-backend    #   24.11% backend  cycles idle     7,157,837,211 instructions              #    1.07  insns per cycle                                                  #    0.61  stalled cycles per insn    490,110,757 branches                  #  268.195 m/sec                          178,287 branch-misses             #    0.04% of branches             1.829712061 seconds time elapsed

results fast code

 performance counter stats './fast_o0':     1109.451910 task-clock                #    0.998 cpus utilized                       95 context-switches          #    0.086 k/sec                                1 cpu-migrations            #    0.001 k/sec                          195,448 page-faults               #    0.176 m/sec                    4,067,613,078 cycles                    #    3.666 ghz                      1,784,131,209 stalled-cycles-frontend   #   43.86% frontend cycles idle       438,447,105 stalled-cycles-backend    #   10.78% backend  cycles idle     7,356,892,998 instructions              #    1.81  insns per cycle                                                  #    0.24  stalled cycles per insn    489,945,197 branches                  #  441.610 m/sec                          176,136 branch-misses             #    0.04% of branches             1.111398442 seconds time elapsed

so can see though "fast" code executes more instructions has fewer stalls. when out-of-order cpu (like x64 architectures) executing code keeps track of dependencies between instructions. waiting instruction can bypassed instruction if operands ready.

in illustration critical point instruction sequence:

  andq  %rdx, %rax   movq  %rax, -16(%rbp)   movq  $0, -16(%rbp)     ;; instruction in fast not slow   movq  -16(%rbp), %rax     leaq  0(,%rax,4), %rdx   movq  -8(%rbp), %rax

in fast code movq -8(%rbp), %rax instruction result movq $0, -16(%rbp) forwarded , able execute sooner. whereas slower version have wait movq %rax, -16(%rbp) has more dependencies between loop iterations.

note without knowing more specific microarchitecture analysis simplistic. suspect underlying cause dependency , performing store of 0 (the movq $0, -16(%rbp) instruction) allows cpu perform more aggressive speculation when executing code sequence.

c gcc

Search This Blog

Jaimee

c - gcc simple arithmetics loop performance -

Comments

Post a Comment

Popular posts from this blog

c - Compilation of a code: unkown type name string -

java - Bypassing "final local variable defined in an enclosing type" -

json - Hibernate and Jackson (java.lang.IllegalStateException: Cannot call sendError() after the response has been committed) -