More

s_kanev · on April 27, 2020

This was the case a few years back when the fastest pools were implemented with recursive data structures (e.g. linked lists for the freelists in gperftools).

In the new tcmalloc (and, I think, hoard?) the fastest pools are essentially slabs with bump allocation, so the fastest (and by far, the most common) calls are a grand total of 15 or so instructions, without many cache misses (size class lookups tend to stay in the cache). Call overhead can be a substantial chunk of that.

pizlonator · on April 27, 2020

First of all, let’s be clear: for a lot of algorithms adding call overhead around them is free because the cpu predicts the call, predicts the return, and uses register renaming to handle arguments and return.

But that’s not the whole story. Malloc perf is about what happens on free and what happens when the program accesses that memory that malloc gave it.

When you factor that all in, it doesn’t matter how many instructions the malloc has. It matters whether those instructions form a bad dependency chain, if they miss cache, whether the memory we return is in the “best” place, and how much work happens in free (using the same metrics - dependency chain length and misses, not total number of instructions or whether there’s a call).

jlebar · on April 27, 2020

> First of all, let’s be clear: for a lot of algorithms adding call overhead around them is free because the cpu predicts the call, predicts the return, and uses register renaming to handle arguments and return.

This is only thinking at the hardware layer. There are also effects at the software layer.

Function calls not known to LTO prevent all sorts of compiler optimizations as well. In addition, as Jeff says on this thread, not being able to inline free means you get no benefit from sized-delete, which is a substantial improvement on some workloads. (I'd cite the exact number, but I'm not sure Google ever released it.)

Source: I literally worked on this.

ahh · on April 27, 2020

LTO I will grant you. I don't think you need to meaningfully inline the body of delete to get sized-delete to work effectively. You do want to statically link your sized operator delete body though, obviously. As of the last time I tried this, trying to inline the actual body was difficult to make effective, though.

(parent: read username :))

jlebar · on April 27, 2020

Hi, Andrew. :)

Agree, I shouldn't have been loose wrt the distinction between inlining and LTO. I think I also ran that experiment with inlining sized-deletes and came to the same (surprising, to me) conclusion.

pizlonator · on April 27, 2020

Let’s be clear: even without LTO, the compiler knows what malloc and free do at a semantic level.

I buy the sized delete win, I’ve seen that too, in some niche cases. But that’s a far cry from the claim that not inlining malloc is bananas. If that’s all you got then this sounds like a stunning retreat from the thesis of Jeff’s post. You’re just saying that you’ve got a free that benefits from size specialization, which has nothing to do with whether malloc got inlined.

jlebar · on April 27, 2020

> even without LTO, the compiler knows what malloc and free do at a semantic level.

  int *a = ...;
  int b = *a;
  malloc(42);
  b = *a;
  printf(b);

If malloc is an opaque function, can the compiler nonetheless eliminate the second `b = *a` load as dead because it "knows what malloc does"? Certainly not, right?

jepler · on April 28, 2020

https://godbolt.org/z/z-92qR

Interestingly, the compiler emits no call to malloc, even. What!?

steerablesafe · on April 28, 2020

With -fno-builtin. It's still "cheating" as b is local.

https://godbolt.org/z/U5VGxt

Without cheating:

https://godbolt.org/z/rzcXg2

It doesn't optimize the two stores to b away.

gpderetta · on April 28, 2020

The compiler knows the semantics of malloc. Malloc as it appear in the source code doesn't even need to map 1:1 to calls to the library function, the compiler can remove and coalesce calls at will.

ckennelly · on April 28, 2020

You can still get a benefit from sized delete, even without inlining.

orlp · on April 27, 2020

You are not wrong but also missing the point of the person you're replying to.

s_kanev · on March 23, 2020

Not OP, but I do (a fair amount of) HPDE/TT with an m240i, converted to the cup car spec.

Running costs for me are ~$500 / 2-hour day on 100-TW tires. These cars aren't exactly light, so they eat through consumables. Tires are the largest part, then repairs, then brakes. This doesn't count fuel.

If I had to guess for OP in PWC, 50-100% more, mostly driven by slick costs.

EDIT: added fuel note

s_kanev · on Aug 24, 2016

Both static and dynamic histograms can be pretty important, actually. Dynamic ones for performance, static ones -- usually for correctness (imagine developing a tool just like QEMU, which needs to emulate each instruction type).

Performance counters by themselves aren't granular enough for an exact histogram. But you can use them (especially the LBR[1] and the fancy new PT[2]) to reconstruct an approximate control-flow graph, and with a bit of post-processing it's easy to get per-instruction call frequencies.

A long time ago, I wrote a paper on x86 trace compression that needed a dynamic histogram like the one you mentioned. As expected, the CDF rises very very fast [3, Fig. 5] -- you can cover a very large fraction of execution with a very small number of instructions.

[1] http://lwn.net/Articles/680996/ [2] http://www.halobates.de/pt-tracing-summit15.pdf [3] http://skanev.org/papers/ispass11zcompr.pdf

s_kanev · on Feb 23, 2015

I beg to disagree. For instance, it's been running for years sampling Google's fleet [1] (the paper mentions oprofile, but that has been replaced quite a while ago).

[1] http://courses.cs.washington.edu/courses/cse590g/11sp/36575....

s_kanev · on Dec 9, 2014

Have you seen xed [1]? It seems to fit the bill nicely from what I understand about your requirements.

[1] https://software.intel.com/sites/landingpage/pintool/docs/67...

blt · on Dec 10, 2014

also Xed does not seem to be open source. It ships with a bunch of headers that say "Intel Open Source License" at the top but there is a binary library instead of source files.

blt · on Dec 10, 2014

yes, I don't need its decoding capability and I don't think its syntax is pretty but if I ever decide to write my own library I'll probably wrap xed in some C++ sugar.

_dps · on Dec 10, 2014

I'm not sure I understand your requirements but might DynASM be useful? It's one component of the JIT library behind LuaJIT but many people use it for run-time code generation completely outside a JIT setting.

http://luajit.org/dynasm.html

blt · on Dec 10, 2014

DynASM looks cool but its preprocessing step and fancy C integration definitely place it outside the description "Just an instruction encoder with a moderately pretty syntax."

I guess it's not fair to say that XED and DynASM are "much more complicated than the problems they're trying to solve." They are much more complicated than the problem I'm trying to solve. But I am surprised that there is no minimal X86 encoder with nice C++ syntax out there.

s_kanev · on Nov 20, 2014

My go-to practitioner: Jean-luc Doumont: http://www.principiae.be/X0302.php I highly recommend his talks (he tours universities in the US quite often) -- they are full of small practical nuggets on data presentation.

juretriglav · on Nov 20, 2014

Wow! Great reference and lots of material to study. He too would be perfect on a new presentation standardization working group. I have to say that since this post has gone live, I've gotten so many great references, just shows a lot of very smart people thought about this problem - it would be fantastic to connect them all somehow.

s_kanev · on Oct 29, 2014

Yeah, cpu idleness is pretty fascinating and, as it turns out, quite important. Shameless plug, I just published a paper [1], showing that if you mess up how deeply you go to sleep in cpuidle, you can loose up to ~15% latency in datacenter services like websearch.

[1] http://www.skanev.org/papers/iiswc14ep.pdf

amluto · on Oct 30, 2014

This reminds me: if you're stuck on a Linux kernel before 3.9 and you're using Sandy Bridge hardware or newer, turn off C1E in BIOS. The "C1E" setting controls a misfeature in which the CPU will look at the OS's request for a light sleep and say "Ha! You must have asked for light sleep by accident! I'm going to go all the way to sleep instead!" This causes surprisingly severe performance problems (we've seen milliseconds of unexplained random hiccups).

Linux 3.9 and higher is immune: it just forcibly disables this feature.

makomk · on Oct 30, 2014

Hah. I guess that's a major benchmark win - decreases power usage in common cases, and few CPU benchmarks test wake-from-sleep latency either directly or indirectly.

s_kanev · on Feb 7, 2014

OP here. Totally agree. TIL about RAII. That particular codebase doesn't use c++ exceptions, but still not a reason for a roll-your-own solution if something as efficient is in the standard library.

dicroce · on Feb 7, 2014

Does it use the STL? The STL will throw. Does it use the new operator? new will throw unless you specifically tell it not to.

s_kanev · on Feb 7, 2014

Ok, should've been more specific -- it doesn't catch any exceptions and lets them bring down the program and have the project maintainer check what they did wrong.

s_kanev · on Dec 5, 2013

Spent a few hours trying to look into this. The results so far suggest it has to do with the register renaming mechanism, but not completely conclusive. You can check out my notes here: http://blog.skanev.org/2013/12/call-performance-weirdness.ht...

s_kanev · on March 8, 2013

For an intro read, Hennessy and Patterson's "Computer Organization and Design" is good. Their other book, "Computer Architecture: A Quantitative Approach" is the de facto bible, but it's slightly more advanced, and not a good first read. Another good intro book is "Computer Systems: A Programmer's Perspective" - especially for software people.

adambom · on March 8, 2013

Thanks, I'll check out those titles. I wonder if they still have libraries these days?