Speaking of profiling, I wrote an inline profiler for multi threaded Linux apps called ThreadTracer. It's quite good, as it records both wall clock time and cpu time. And on top of that, it also keeps track of pre-empted threads and voluntary context switches.
It's a fine tool, making use of Google Chrome's tracing capability to view the measurements. But for the really low level, micro-sized measurements, we need an additional tool. And this tool is the linux perf tool.
Instead of using perf directly, I found that using the ocperf.py wrapper from PMU-Tools to be a much better option. It has proven to be more reliable for me. You can make it sample your program to see where the cycles are spent. I use the following command line:
$ ocperf.py record -e cpu-cycles:pp --call-graph=dwarf ./bench $ ocperf.py report -g graph,0.25,caller
In addition to the perf wrapper, it also comes with a great overall analysis tool called toplvl.py which gives you a quick insight into potential issues. Start at level 1 (-l1) and drill your way down to more specific issues, using:
$ toplev.py --long-desc -l1 ./bench