c - Compiler Optimizations effect on FLOPs and L2/L3 Cache Miss Rate using PAPI -



c - Compiler Optimizations effect on FLOPs and L2/L3 Cache Miss Rate using PAPI -

so we've been tasked assignment compile code (we're supposed treat black box), using different intel compiler optimization flags (-o1 , -o3) vectorization flags (-xhost , -no-vec) , observe changes in:

execution time floating point operations (fpos) l2 , l3 cache miss rate

after performing these optimizations, we've noticed drop in execution time, expected, considering changes compiler makes code sake of efficiency. however, noticed drop in number of fpos, while understand it's thing, we're not sure why happened. also, noticed (and cannot explain) increment in l2 cache miss rate (increasing optimization level increased), no important increment in cache accesses, , no changes on l3 level.

using no vectorization or optimization @ produced best result in terms of l2 cache miss rate, , wondering if guys give insight, supported documentation, literature, , resources can utilize deepen our knowledge on topic.

thank you.

edit: compiler options used are:

o0 -no-vec (highest execution time, lowest l2 cache misses) o1 (less execution time, higher l2 cache misses) o3 (even less execution time, higher l2 cache misses) xhost (same order of -o3 execution time, highest l2 cache misses)

update:

while there little decrease in overall l2 cache accesses, there massive increment in actual misses.

with -0o -no-vec

wall clock time in usecs: 13957075

l2 cache misses: 207460564 l2 cache accesses: 1476540355 l2 cache miss rate: 0.140504 l3 cache misses: 24841999 l3 cache accesses: 207460564 l3 cache miss rate: 0.119743

with -xhost

wall clock time in usecs: 4465243

l2 cache misses: 547305377 l2 cache accesses: 1051949467 l2 cache miss rate: 0.520277 l3 cache misses: 86919153 l3 cache accesses: 547305377 l3 cache miss rate: 0.158813

on reduced number of floating-point ops: optimization, compiler may hoist mutual calculations out of loops, fuse constants, pre-calculate expressions , on.on increased cache miss-rate: if compiler uses vectorization, , loads total vector-width worth of info every time, utilize much fewer loads memory in total. every time accesses cacheline in way predictor didn't anticipate, still causes cache miss.together, have fewer loads, same number of cachelines touched, rate of misses can higher.

c intel compiler-optimization cpu-cache flops

Comments

Popular posts from this blog

Delphi change the assembly code of a running process -

json - Hibernate and Jackson (java.lang.IllegalStateException: Cannot call sendError() after the response has been committed) -

C++ 11 "class" keyword -