This was the case a few years back when the fastest pools were implemented with recursive data structures (e.g. linked lists for the freelists in gperftools).
In the new tcmalloc (and, I think, hoard?) the fastest pools are essentially slabs with bump allocation, so the fastest (and by far, the most common) calls are a grand total of 15 or so instructions, without many cache misses (size class lookups tend to stay in the cache). Call overhead can be a substantial chunk of that.
First of all, let’s be clear: for a lot of algorithms adding call overhead around them is free because the cpu predicts the call, predicts the return, and uses register renaming to handle arguments and return.
But that’s not the whole story. Malloc perf is about what happens on free and what happens when the program accesses that memory that malloc gave it.
When you factor that all in, it doesn’t matter how many instructions the malloc has. It matters whether those instructions form a bad dependency chain, if they miss cache, whether the memory we return is in the “best” place, and how much work happens in free (using the same metrics - dependency chain length and misses, not total number of instructions or whether there’s a call).
> First of all, let’s be clear: for a lot of algorithms adding call overhead around them is free because the cpu predicts the call, predicts the return, and uses register renaming to handle arguments and return.
This is only thinking at the hardware layer. There are also effects at the software layer.
Function calls not known to LTO prevent all sorts of compiler optimizations as well. In addition, as Jeff says on this thread, not being able to inline free means you get no benefit from sized-delete, which is a substantial improvement on some workloads. (I'd cite the exact number, but I'm not sure Google ever released it.)
LTO I will grant you. I don't think you need to meaningfully inline the body of delete to get sized-delete to work effectively. You do want to statically link your sized operator delete body though, obviously. As of the last time I tried this, trying to inline the actual body was difficult to make effective, though.
Agree, I shouldn't have been loose wrt the distinction between inlining and LTO. I think I also ran that experiment with inlining sized-deletes and came to the same (surprising, to me) conclusion.
Let’s be clear: even without LTO, the compiler knows what malloc and free do at a semantic level.
I buy the sized delete win, I’ve seen that too, in some niche cases. But that’s a far cry from the claim that not inlining malloc is bananas. If that’s all you got then this sounds like a stunning retreat from the thesis of Jeff’s post. You’re just saying that you’ve got a free that benefits from size specialization, which has nothing to do with whether malloc got inlined.
> even without LTO, the compiler knows what malloc and free do at a semantic level.
int *a = ...;
int b = *a;
malloc(42);
b = *a;
printf(b);
If malloc is an opaque function, can the compiler nonetheless eliminate the second `b = *a` load as dead because it "knows what malloc does"? Certainly not, right?
The compiler knows the semantics of malloc. Malloc as it appear in the source code doesn't even need to map 1:1 to calls to the library function, the compiler can remove and coalesce calls at will.
Not OP, but I do (a fair amount of) HPDE/TT with an m240i, converted to the cup car spec.
Running costs for me are ~$500 / 2-hour day on 100-TW tires. These cars aren't exactly light, so they eat through consumables. Tires are the largest part, then repairs, then brakes. This doesn't count fuel.
If I had to guess for OP in PWC, 50-100% more, mostly driven by slick costs.
Both static and dynamic histograms can be pretty important, actually. Dynamic ones for performance, static ones -- usually for correctness (imagine developing a tool just like QEMU, which needs to emulate each instruction type).
Performance counters by themselves aren't granular enough for an exact histogram. But you can use them (especially the LBR[1] and the fancy new PT[2]) to reconstruct an approximate control-flow graph, and with a bit of post-processing it's easy to get per-instruction call frequencies.
A long time ago, I wrote a paper on x86 trace compression that needed a dynamic histogram like the one you mentioned. As expected, the CDF rises very very fast [3, Fig. 5] -- you can cover a very large fraction of execution with a very small number of instructions.
I beg to disagree. For instance, it's been running for years sampling Google's fleet [1] (the paper mentions oprofile, but that has been replaced quite a while ago).
also Xed does not seem to be open source. It ships with a bunch of headers that say "Intel Open Source License" at the top but there is a binary library instead of source files.
yes, I don't need its decoding capability and I don't think its syntax is pretty but if I ever decide to write my own library I'll probably wrap xed in some C++ sugar.
I'm not sure I understand your requirements but might DynASM be useful? It's one component of the JIT library behind LuaJIT but many people use it for run-time code generation completely outside a JIT setting.
DynASM looks cool but its preprocessing step and fancy C integration definitely place it outside the description "Just an instruction encoder with a moderately pretty syntax."
I guess it's not fair to say that XED and DynASM are "much more complicated than the problems they're trying to solve." They are much more complicated than the problem I'm trying to solve. But I am surprised that there is no minimal X86 encoder with nice C++ syntax out there.
My go-to practitioner:
Jean-luc Doumont: http://www.principiae.be/X0302.php
I highly recommend his talks (he tours universities in the US quite often) -- they are full of small practical nuggets on data presentation.
Wow! Great reference and lots of material to study. He too would be perfect on a new presentation standardization working group. I have to say that since this post has gone live, I've gotten so many great references, just shows a lot of very smart people thought about this problem - it would be fantastic to connect them all somehow.
Yeah, cpu idleness is pretty fascinating and, as it turns out, quite important. Shameless plug, I just published a paper [1], showing that if you mess up how deeply you go to sleep in cpuidle, you can loose up to ~15% latency in datacenter services like websearch.
This reminds me: if you're stuck on a Linux kernel before 3.9 and you're using Sandy Bridge hardware or newer, turn off C1E in BIOS. The "C1E" setting controls a misfeature in which the CPU will look at the OS's request for a light sleep and say "Ha! You must have asked for light sleep by accident! I'm going to go all the way to sleep instead!" This causes surprisingly severe performance problems (we've seen milliseconds of unexplained random hiccups).
Linux 3.9 and higher is immune: it just forcibly disables this feature.
Hah. I guess that's a major benchmark win - decreases power usage in common cases, and few CPU benchmarks test wake-from-sleep latency either directly or indirectly.
OP here. Totally agree. TIL about RAII.
That particular codebase doesn't use c++ exceptions, but still not a reason for a roll-your-own solution if something as efficient is in the standard library.
Ok, should've been more specific -- it doesn't catch any exceptions and lets them bring down the program and have the project maintainer check what they did wrong.
For an intro read, Hennessy and Patterson's "Computer Organization and Design" is good. Their other book, "Computer Architecture: A Quantitative Approach" is the de facto bible, but it's slightly more advanced, and not a good first read. Another good intro book is "Computer Systems: A Programmer's Perspective" - especially for software people.
In the new tcmalloc (and, I think, hoard?) the fastest pools are essentially slabs with bump allocation, so the fastest (and by far, the most common) calls are a grand total of 15 or so instructions, without many cache misses (size class lookups tend to stay in the cache). Call overhead can be a substantial chunk of that.