When Optimising Code, Measure

dspillett · on Jan 16, 2024

Also measure many times, and make sure any difference you see isn't accounted for by variation in the test system (CPU frequency scaling due to temperature, your test being run partly (or entirely) on low-power cores if your CPU has such a distinction, network contention if you aren't working locally or over a link dedicated to the tests, etc.).

Also where IO is at all involved remember to test both cold and warm starts, and make sure that if your new version seems faster in the results that it actually is rather than a previous run having “warmed up” something. Make sure any prep you do before each measurement doesn't inadvertently change the results – keep prep and actual the tests separate (and cool down between if trying to analyse cold-run performance). On this latter point I once had a heck of a time convincing someone that his new index hadn't sped up a SQL query by an order of magnitude – he was correctly running from cold each time but was adding the index and running the query immediately (without dropping cached pages between), the affected table was small enough to fit in memory and adding the index caused a full scan so when the actual test was done the system was no longer cold.

thom · on Jan 16, 2024

Worth pointing out that this is actually what Knuth meant by premature optimisation. Nothing to do with optimising too early in a project or some sense of YAGNI, but rather optimising before you know where the actual time is spent.

norir · on Jan 16, 2024

I don't think the original paper by Knuth (https://dl.acm.org/doi/10.1145/356635.356640) supports this interpretation. In the paper, Knuth describes how as one develops as a programmer one learns how to write in a style that is generally reasonably efficient. In particular, you should take note of hot loops, which are often quite obvious, and then write in the most readable style outside of the hot loops without too much thought about efficiency. By accepting some inefficiency, but writing in a generally efficient style, you can move much more quickly than if you are constantly bogged down in the details trying to squeeze out every possible cpu cycle.

What people tend to miss in my view is that he is advocating writing in an efficient style, which may not be obvious to an inexperienced developer. For example, writing a linear algebra routine in which you traverse column wise in an inner loop when the data is laid out row wise is going to be very slow and a simple change of algorithm or restructuring of the data can lead to an enormous speed up. This should be done immediately if discovered. But a skilled programmer will also be unlikely to traverse in the wrong order anyway because they already know the data layout and have thought about the order of traversal as part of their thinking process. In other words, the "optimization" of traversing in the fast way is something they would just do naturally and has no influence on the development time.

Conversely, they may also suspect that with some clever bit-twiddling hacks, they may be able to squeeze another 5-10% performance out of the routine. Unless you know that the code is being distributed at scale, this is almost certainly a waste of time. It will be hard to write as well as brittle and error prone. It deflects your attention from more important problems. It is evil both because of the opportunity cost of chasing these small efficiencies as well as the increased complexity that they often introduce for minimal gain.

thom · on Jan 16, 2024

I'm pasting the famous passage below, and I don't know how to interpret it as anything other than a call to correctly identify the performance critical part of one's code before spending significant time on optimization. If you can intuit that, more power to you! If you can't, you must measure, and Knuth gives the suggestion that your compiler should do this automatically for you:

"There is no doubt that the grail of efficiency leads to abuse. Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%. A good programmer will not be lulled into complacency by such reasoning, he will be wise to look carefully at the critical code; but only after that code has been identified. It is often a mistake to make a priori judgments about what parts of a program are really critical, since the universal experience of programmers who have been using measurement tools has been that their intuitive guesses fail. After working with such tools or seven years, I've become convinced that all compilers written from now on should be designed to provide all programmers with feedback indicating what parts of their programs are costing the most; indeed, this feedback should be supplied automatically unless it has been specifically turned off."

I certainly agree with the interpretation that the _cost_ is maintainability of prematurely optimised code and this is the evil he's describing. But this is also all in the context of efficiency as a virtue in itself - i.e. something you should be seeking, but seeking in the correct places.

fl0ki · on Jan 16, 2024

My problem is that people use this as an excuse not to consider optimization at all initially, because they think that once they have a working version, they can "just" profile and optimize the slowest few functions to get ideal performance. From the point of view of that singular task, yes, they achieved an optimization objective. From the point of view of overall performance vs what was possible given certain hardware, they could still be several orders of magnitude away from optimal.

A profiler can't tell you that an entire operation would become completely unnecessary if you maintain a certain invariant elsewhere, and this can go as far out as you let it, e.g. entirely changing your in-memory data model to prepare it for the operations you know you'll be performing in the critical path.

I know because I've been fortunate to work in environments full of seasoned experts writing world-scale software in C++ and Rust, that I was still able to optimize 100x-1Mx by zooming out and solving the same problem from a new perspective, something a profiler cannot help with.

And I'm _nothing_ compared to the people that come around and make it another 10x faster with SIMD or some cutting edge computer science research I'm not across. (Remember when NFA regex was a breakthrough and now looks like a joke compared to DFA?). Even a profiler hinting that a hot function might benefit from SIMD can't tell you how you should arrange your overall data structures to fully benefit from SIMD, or cache locality for that matter.

This isn't an attack on what you said, and what I'm saying is bordering on incoherent because there's so much more to the performance optimization iceberg than the "profiling" tip that most engineers are permanently stuck on. But no, every time aggressive holistic optimization conversations come up, more than half of the audience will dismiss it as "premature" or insist on just profiling and decline to learn anything. Then people like me have to inherit and rewrite their projects because they've ground our global scale to a halt.

_a_a_a_ · on Jan 16, 2024

> they could still be several orders of magnitude away from optimal

I have a fair bit of experience in optimisation and this is ridiculously... optimistic, shall we say. The only way you can get literally orders of magnitude speed up is if some pillock put literally orders of magnitude of slowdown in originally, typically by combining a couple of O(n^2) operations, and even then only seen that a few times and it's easy to fix.

fl0ki · on Jan 16, 2024

I've seen far worse crimes than that :)

Example: Writing what is effectively a graph algorithm where every edge/node lookup is a string key in a hash map. Transforming the string keys into dense integer offsets once-off costs nearly nothing, and reduces every single lookup into a direct memory offset. That one optimization alone was an over 30x speedup, and you can imagine that if they wrote it like this then there was a lot of other low hanging fruit.

These people passed one of the most famous computer science hiring bars in the industry, and this code passed review.

When I interview people with a puzzle version of a problem like this, 95% of candidates still use string keys instead of reducing them to integers, so this is shockingly common and not at all an edge case. It's just one of dozens of common performance antipatterns I see in code written by people with TC over $500k USD.

It's also why I call bullshit when people online claim that computer science interviewing is gatekeeping and they don't need algorithms in their day job. This is false both ways; it's not gatekeeping successfully enough because nonsense like this makes it to global production and costs tens of millions, and anyone who wants to succeed in this job is expected to know better.

HALtheWise · on Jan 16, 2024

I'm curious how this interacts with language selection. In particular, it looks like most of the programming languages in widespread use in 1974 were all relatively efficient compiled and statically typed languages (Fortran, cobol, C) with the notable exception of Basic. Would Knuth have considered choosing a performant language as "optimization" or just as part of "writing in an efficient style as any good programmer should do"?

The distinction is a lot more important today with the ~100x speed and power consumption differences between Ruby/Python/etc and C++/Java/Go/Rust/etc.

thom · on Jan 16, 2024

Knuth makes this argument more clearly elsewhere, in Computer Programming as an Art:

"Another important aspect of program quality is the efficiency with which the computer's resources are actually being used. I am sorry to say that many people nowadays are condemning program efficiency, telling us that it is in bad taste. The reason for this is that we are now experiencing a reaction from the time when efficiency was the only reputable criterion of goodness, and programmers in the past have tended to be so preoccupied with efficiency that they have produced needlessly complicated code; the result of this unnecessary complexity has been that net efficiency has gone down, due to difficulties of debugging and maintenance. The real problem is that programmers have spent far too much time worrying about efficiency in the wrong places and at the wrong times; premature optimization is the root of all evil (or at least most of it) in programming."

If you have a holistic definition of efficiency, then it might make sense to use a performant language for the performance critical parts of a wider system, but more productive languages elsewhere. You might make a similar argument about languages that add some productivity overhead to achieve stronger guarantees of safety or correctness.

jraph · on Jan 16, 2024

I didn't know this, thanks for pointing it out.

I think this has been a costly misinterpretation of "premature optimisation". This phrase has been taken as justification for writing inefficient code, understanding it as "performance doesn't matter [for what we do]".

It almost always matters, and I believe it is possible to write both efficiently and in a readable way. Unfortunately, inefficient code is well established, in habits and even in (sometimes mandatory) coding styles. This inefficiency also often translates to hard to follow code because of too much abstraction.

Maybe we could counter this misinterpretation of "premature optimisation" with "premature abstraction".

ReleaseCandidat · on Jan 16, 2024

> This phrase has been taken as justification for writing inefficient code, understanding it as "performance doesn't matter [for what we do]".

There is no contradiction. Writing inefficient code is perfectly fine until you _know_ (this means you have measurements that show that) it is _too_ inefficient and has to be optimised.

How do you measure the "optimisation" without "inefficient code" as baseline to begin with?

vlovich123 · on Jan 16, 2024

Depends what you mean by inefficient. The other thing Knuth meant is that you should focus your efficiency energy first on overall design efficiency. Algorithmic and other efficiencies are irrelevant if your design is bad and a bad design cannot be fixed by looking at the profiler and fixing hotspots

jraph · on Jan 16, 2024

> Writing inefficient code is perfectly fine until you _know_ (this means you have measurements that show that) it is _too_ inefficient and has to be optimised.

What about inefficient code only if you _know_ that it does not matter?

I'm not talking about optimizing. I'm talking about avoiding inefficiencies in the first place.

This is what I'm talking about: we are in "inefficient by default" mode. That leads to requiring powerful hardware to display the most lambda websites.

edflsafoiewq · on Jan 16, 2024

In that case, how do you know you're measuring the "optimization" and not "how awful the original code was"?

reitzensteinm · on Jan 16, 2024

If you're measuring performance and haven't seen this talk by Emery Berger, I highly recommend it.

It's mind blowing, but also terrifying when you realize how many mistakes you've been making up until this point.

https://youtu.be/7g1Acy5eGbE

tialaramex · on Jan 16, 2024

I've watched another version of this talk (maybe at Strange Loop), but I'm pretty sure I hadn't seen this hangup on measuring weird C++ artefacts against Python before which comes before the interesting stuff and for me was a big disappointment.

Comparing the size of C++ int (which will be the 32-bit signed integer because hey why not) against Python's integers (which are just fully arbitrary big integers) is unfair.

But it doesn't stop there, next it measures std::list<int> which is a doubly linked list, a container that C++ keeps around partly because it is infested with the sort of programmers who've heard this data structure is sometimes a good choice (it is) and therefore reach for it first -- exactly the sort of bad optimization we're warning against. Python doesn't even have a doubly linked list.

Then it compares std::map<int,int> against Python's dict, but std::map is a red-black tree of linked lists, Python's dict is a hashtable, because of course it is. So these aren't anywhere close to comparing like with like either.

screye · on Jan 16, 2024

Not everyday you see Emery be the top comment on HN.

I was lucky to be in his courses in grad-school. It's the most fun I've had in an American classroom, all while discussing serious systems stuff. A gem of a professor.

osigurdson · on Jan 16, 2024

If you don't have a performance culture your app will not be performant. The idea that performance problems are easy to reproduce and a profiler will always point to a single, easy to fix problem is naive.

sergioisidoro · on Jan 16, 2024

I could think of a nice corollary for this: If it's not worth measuring, it's not worth optimising.

stinos · on Jan 16, 2024

How to decide whether it's worth measuring without measuring?

OscarDC · on Jan 16, 2024

It may be that the entire performance including the logic is satisfactory.

Or it may be that the conditions to reach that logic are too rare to make the time to measure and optimize worthwhile.

In any case, I understand it as: if you're not bothered about measuring the performance of something, it may be that you don't need to optimize it.

WJW · on Jan 16, 2024

> It may be that the entire performance including the logic is satisfactory.

In which case you have already measured it, even if not very precisely.

> Or it may be that the conditions to reach that logic are too rare to make the time to measure and optimize worthwhile.

This too cannot be determined without measuring how often it happens.

OscarDC · on Jan 16, 2024

By what I understood of your reply, everything is a measure and I fail to see how something cannot be measured. Or perhaps I wasn't clear in what I meant initially?

> In which case you have already measured it, even if not very precisely.

If you're doing a CLI tool for example and it responds soon enough for it to be a bother, you won't want to measure a sub-part of its logic. You may be blindly "measuring" the time the whole command took with your eyes and thoughts, but you're not measuring in any way the time taken by the corresponding logic in the whole thing.

> This too cannot be determined without measuring how often it happens.

If the CLI tool logic sub-part is in a specific combination of flags and conditions that you for now didn't see a use case for or even sometimes possibility yet, you may also not measure it, without needing to have to "measure how often it happens". For example you may want to skip some optimizations in what you think will be rarely encountered error handling code, even if you actually never measured it. Even if it does happen more often than you thought, low performance may still be acceptable in some unexpected code (e.g. after a typo in a CLI tool flags).

bluGill · on Jan 16, 2024

If nobody notices then it isn't worth measuring.

osigurdson · on Jan 16, 2024

Wait until your boss tells you something is a problem - this is how a lot of people think.

lysecret · on Jan 16, 2024

Haha this is amazing.

StarlaAtNight · on Jan 16, 2024

These online tech spaces always assume that saving on compute, storage, and memory are the ultimate targets for optimization.

I’ve worked (mostly on business/data teams) in companies for 10 years, and have to say, saving on technical debt and human inefficiency w/ a codebase typically far outweighs the computational metrics

How might we incorporate that in a metric-like way?

lysecret · on Jan 16, 2024

I agree. However simple to understand code with high human efficiency is nearly impossible to measure. Honestly, maybe ChatGPT could give you the best evaluation/measure of that.

Shish2k · on Jan 16, 2024

“write code clear enough that an LLM can look at it and create an accurate summary of the high-level goals” gives me similar vibes to “explain like I’m 5”, which seems like it could be an interesting metric (and my gut tells me that things will go horribly wrong as soon as somebody turns this metric into a target :D)

marcosdumay · on Jan 16, 2024

It looks like a metric of the kind of lines of code.

Hugely useful for measuring a lot of things. But obviously bad for optimizing even if it's not a formal target.

adammarples · on Jan 16, 2024

It's easy to measure. You sit a dev down in front of it and time how long it takes them to understand it. That's not just an estimate that's the key value itself.

karles · on Jan 16, 2024

Well, human processes can be visulised and their completion time can be estimated, as can possible breakdowns.

Eliminating breakdowns thus eliminate "waste" in required effort, coordination etc.

So I don't see how human processes differs from technical ones that much...

vlovich123 · on Jan 16, 2024

What techniques do people use to profile code where the core path takes <<1ms? A sampling profiler is max 1khz and thus aliasing artifacts will hide the true source of hotspots I find and valgrind is too slow. The other problem I have is that diffuse hotspots are hard to root cause when running the optimized version (eg hard to spot bounds checking). I’m not confident that simply running the code under test longer helps with the aliasing artifacts (probably because if the core loop is running in 1us then I have to run for 1000x longer to remove the sampling bias due to Nyquist if that’s even the right strategy?)

phkahler · on Jan 16, 2024

>> What techniques do people use to profile code where the core path takes <<1ms?

Make a loop that runs the code 5000 times and time that. What do you mean by "diffuse hot spots"? If a piece of code taking 1ms is called infrequently then optimizing it is not going to have a meaningful impact on your overall execution time.

You can measure the time for longer running higher level functions that do a lot. If you then optimize some small function that's a leaf in the call graph, you can see the impact on the larger function - and it probably won't be much.

One thing that can happen when nobody cares about performance is that every part of the code gets little inefficiencies and fixing any one of them has very little impact, but fixing ALL of them can be really significant. In that case you should start by profiling or timing the higher level functions and finding ones that have relatively small call trees that you can work on. Once you optimize one piece, the effects of optimizing others are a bit more significant because they are a larger part of overall execution time.

vlovich123 · on Jan 16, 2024

Diffuse could be something like bounds checking in Rust. It’s not code I’ve written nor will any profiler call out the bounds check - they’ll at best point to the high level collection method but each one is a speed bump. Or the compiler generating inefficient code - spelunking through assembly hoping to spot the problem isn’t an effective use of time and profilers I’ve tried (valgrind and hotspot) have been unhelpful in giving me the info.

The advice you’ve provided is correct philosophy but not actually actionable - I’m having trouble finding the inefficiencies.

artimaeis · on Jan 16, 2024

Profiling != benchmarking. Profiling is great to find _where_ you should setup benchmarks in your code. If you've got code in a path that consistently takes <1ms and isn't called enough times to significantly add to the profiling of the application, would it really warrant benchmark comparisons?

As others have said, benchmark frameworks are great. Many systems have support for high-resolution perf counters these days. Just ran a quick test on my local system, it's able to make a timer accurate to 100ns.

Benchmark frameworks let you hook into your code in such a way that you can easily run a single fragment 10k-10m times and produce an analysis of the result that includes mean time, std deviation, and memory allocations. They usually let you setup a side by side comparison so you can directly compare 2+ methods.

If you use a language with a JIT then good frameworks for those include a warmup period to get the best-JIT version of the code in place before collecting data.

vlovich123 · on Jan 16, 2024

Yes I’ve done all that. But I’m still at being able to do the operation 20M times/s (ie 50ns) whereas a prototype I tested could do 100M (10ns) so I’m having trouble spotting the speed bumps that are in the way.

alright2565 · on Jan 16, 2024

I'm confused here. Does this core path only run once per second? I've had no problems profiling blocks of code that short, because it sounds like every inner loop to me.

If you're looking to optimize a few hundred instructions, then focus on reading and understanding the code manually at the instruction level. A profiler will never help you there, CPUs are too complex:

1. I don't trust that the profiler has equal probability of pausing on any specific instruction

2. CPUs do no execute instructions in sequence, but in parallel. So you will need to understand data dependencies & maybe pipeline structures for your target processor. If you're getting really serious, you'll want something like https://uops.info/table.html

vlovich123 · on Jan 16, 2024

So I’m toying with my own database implementation and I’m getting ~20M writes per second. The write operation is a critical path that takes 1/20Mhz (ie 50ns). A sampling profiler will only capture max 1000hz (and even that is pushing it). And a hot spot here is 5-10ns probably. And similarly it could very well be memory transfer costs vs cpu which I’ve found hard to distinguish.

I’ve had luck with valgrind but just reading individual instructions has been uninformative as the code path is too complex for that kind of analysis.

Veserv · on Jan 16, 2024

You can use function entry/exit (FEE) tracing.

A optimized implementation should only incur overhead around one hardware timestamp per function call which is generally around 10-100 ns. That is sufficient for benchmarking quantitative differences at the 1000 ns level and even clear qualitative differences at the 100 ns level. The bias is also fairly stable so you can generally even subtract it out with a little bit of care. And really, the microarchitectural state starts dominating at that point anyways, so anything more fine-grained than that starts depending on making sure the surrounding code did not mess up your L2 or branch predictor and such.

vlovich123 · on Jan 16, 2024

Do you have any link to something that does it? I’ve not been able to find anything. And the things I’m trying to optimize is 50ns -> 10ns so I’d need to be careful about the overhead but hopefully it can be set to a mode that only does it for instrumented functions I’m trying to profile rather than all entry/exit

charleshn · on Jan 16, 2024

I really like magic-trace [0].

https://github.com/janestreet/magic-trace

Not that the exact tracing relies on Intel PT - support for AMD was added recently but uses perf so suffers from the same sampling/skew issues, but is still very useful.

vlovich123 · on Jan 16, 2024

Thanks for the tip. I thankfully went with Intel so it should work and looks like it might get close. My needs are a bit higher frequency than that but it should help with other use cases that aren’t as high frequency + maybe there’s a way to set it to only sample instrumented functions so that the timer cost is further amortized.

maccard · on Jan 16, 2024

A benchmark framework, e.g Google Benchmark, and then WPF/Superluminal/Vtune/uProf.

ChrisMarshallNY · on Jan 16, 2024

This was the lesson that my team learned, as we optimized image processing pipeline code (C++).

The most interesting results, were cache hits. It could have a huge impact (like 1000X), if you could keep execution and data in lower-level caches.

Often, the fix was cut-and-paste repeated code, as opposed to loops or function calls, FP-like static functions, small stack frames, etc.

billyzs · on Jan 16, 2024

> a2, c = a2+c, c+2 > is faster than > a2 += c > c += 2 > My guess is that in the first case the two evaluations and assignments can happen in parallel, and so may happen on different cores

Not sure I follow, isn't Python single threaded by default? Changes to GIL is coming but does it change how the interpreter uses CPU?

kristofferc · on Jan 16, 2024

Yeah, this isn't benchmarking anything related to the CPU etc. It is benchmarking the quirks of the Python interpreter.

danybittel · on Jan 16, 2024

There is instruction level parallelism in modern CPUs. They have multiple "calculation units", that do for example addition. If one doesn't depend on the other they get executed at the same time.

gpderetta · on Jan 16, 2024

But there is no dependency on the expressions on either variant, so there is no reason why the first variant is faster than the second in principle (of course python internals will get in the way and make it hard to reason about performance at all).

reitzensteinm · on Jan 16, 2024

Not sure why you're being downvoted, because you're right.

To back up parent's point, if you compile the code and the resulting assembly is a direct translation, renaming will break the dependency and the CPU will execute the instructions in parallel. Write after read hazard is the applicable section:

https://en.wikipedia.org/wiki/Register_renaming

gpderetta · on Jan 16, 2024

[off topic, but I expect that some amount of downvotes are people misclicking; I know that I found myself correcting many of my own, I wonder how many I don't catch]

epcoa · on Jan 16, 2024

LOL. The amount of machinery going on under the hood in evaluating those expressions in CPython is staggering. A microscopic detail like a single instruction data dependency has nothing to do with it. (How many CPU add instructions are executed just for those statements? Probably hundreds.)

This is much more likely a quirk of the interpreter (or possibly a fucked up test). CPU details are like 10000 feet down.

reitzensteinm · on Jan 16, 2024

You're absolutely right, but laughing at the notion is unnecessarily rude.

epcoa · on Jan 16, 2024

You’re right. Though it is the scenario in my head playing out of someone slaving over an architecture optimization manual while, zoom out, editing Python that was comical, rather than making fun of anyone specifically.

ufo · on Jan 16, 2024

It wouldn't run in separate cores but single-threaded can also get some measure of instruction-level parallelism.

A CPU can do more than one thing at once by computing the next instruction while it's still writing the result of the previous one. However, the CPU can only do that if it's 100% sure that the next instruction does not depend on the previous instruction. This optimization sometimes can't trigger in an interpreter, because of highly mutable variables such as the program counter or the top of the interpreter stack. Fun illustration: https://www.youtube.com/watch?v=cMMAGIefZuM&t=288s

saagarjha · on Jan 16, 2024

Running that across two cores would be a slowdown, not a speedup. You cannot parallelize work like this, because it's too small to be worth it.

kunley · on Jan 16, 2024

The evaluations don't magically/implicitly happen on many cores.

I guess the first thing worth doing when analyzing this would be looking at the differences in the bytecode, then looking at the C code implementing the differing bytecode ops. But there also other factors, like the new adaptive interpreter trying to JIT the code.

zimpenfish · on Jan 16, 2024

For Python 3.12, Godbolt gave almost identical bytecode for both (albeit in different order.) I'm guessing wildly but might this be because `BINARY_OP(+=)` stores the result (because it's `INPLACE`) and then you also do `STORE_FAST(x)` which gives you two stores for the same value compared with one store in the single-line version?

Single-line dual assignment:

    2         2 LOAD_FAST                0 (a2)
              4 LOAD_FAST                1 (c)
              6 BINARY_OP                0 (+)
             10 LOAD_FAST                1 (c)
             12 LOAD_CONST               1 (2)
             14 BINARY_OP                0 (+)
             18 STORE_FAST               1 (c)
             20 STORE_FAST               0 (a2)

vs the two-line version:

    2         2 LOAD_FAST                0 (a2)
              4 LOAD_FAST                1 (c)
              6 BINARY_OP               13 (+=)
             10 STORE_FAST               0 (a2)

    7        12 LOAD_FAST                1 (c)
             14 LOAD_CONST               1 (2)
             16 BINARY_OP               13 (+=)
             20 STORE_FAST               1 (c)

JonChesterfield · on Jan 16, 2024

Competently measuring performance is difficult. Stuff like the OS accumulating cruft over time can dominate trend lines. Broadly irrelevant reshuffles of program layout can have noticeable impacts on cache behaviour.

I don't bother with competent performance measurement. That involves statistics and the like. Instead I run the thing through callgrind and use that as a proxy for reality, or play fewer-branches-better. That and wall clock time to warn me if I've walked off the edge of some quadratic algorithm.

In principle I'd like to measure performance carefully and reliably, but that probably involves running under an emulator hacked to tell you things about performance, or on bare metal as the only code executing on the chip. Both are difficult.

edward28 · on Jan 16, 2024

Clearly, the only option is rewrite it in rust.

asQuirreL · on Jan 16, 2024

You joke but in Rust all of versions 1 to 4 would probably have compiled to the same machine code because LLVM would use Loop Invariant Code Motion and Induction Variable Analysis to perform the same optimisations that are being done by hand here. (Same would be true of most modern C/C++ compilers)

lifthrasiir · on Jan 16, 2024

While modern compilers do such analyses, the loop in question is too complex to be fully analyzed---it has two independent induction variables in a single loop after all.

asQuirreL · on Jan 16, 2024

You're right -- I got nerd-sniped by this, and ended up rewriting the benchmarks in Rust. The compiler successfully inlines the square root operation, simplifies the exponentiation into multiplication and then does common-subexpression elimination over it, but no induction variable analysis, and therefore also no loop-invariant code motion. Here's it is in Godbolt:

https://godbolt.org/z/Pa6cqd1fb

But the funny thing is, I tried it on my machine, and guess what, the "unoptimised" version is consistently between 4 and 5 times faster than the hand-optimised variants:

https://gist.github.com/amnn/4be5c05975c250d0ea88b62c03f1719...

So that's fun.

ColinWright · on Jan 16, 2024

> ... the "unoptimised" version is consistently between 4 and 5 times faster than the hand-optimised variants:

I'm getting this comment a lot, but so far the reason has been that people are using smallish numbers, and implementing the square root in floats.

In the context where this exploration is taking place, you can't do that. We're working with exact huge integers, and while a square root on floats can get you in the ballpark, you then need to refine the answer to get the exact value, and that takes order of magnitude the same time.

asQuirreL · on Jan 16, 2024

How big are we talking? The benchmarks I tried the two numbers in your example, but yes, if you go above 2^53, then you will run into trouble with the naive solution, but that is for correctness reasons, not performance reasons.

ColinWright · on Jan 16, 2024

This is an experiment on aspects of a factoring algorithm ... we're going to be talking about a few hundred digits base 10.

So up to around 10^200 or so, which is around 2^600.

asQuirreL · on Jan 18, 2024

Well yes, that definitely changes the game!!

GrumpySloth · on Jan 16, 2024

Rewriting from Python to anything else would make it faster.

bluGill · on Jan 16, 2024

Maybe. Ruby might be slower. With some creativity I can make c++ slower, I don't know rust well enough to abuse it, but I'm sure rust experts would have no (or only moral) problem making rust slower.

adrianN · on Jan 16, 2024

Ruby?

andrewstuart · on Jan 16, 2024

Easy to say but often quite hard to do.

Measuring, gathering, reporting, understanding what you’re seeing. These are often very complex and time consuming.

maccard · on Jan 16, 2024

If you can't measure it to prove that before and after are different, then you shouldn't optimise it. You run the risk of making things worse.

IshKebab · on Jan 16, 2024

Sometimes things are obviously going to be slow even without measurement. For example not reserving a large vector. Or doing unbuffered IO.

maccard · on Jan 16, 2024

I stand by my point. There's a great stack overflow question [0] that shows a big pitfall of reserve.

IO buffering is also not as clear cut. If you're running on a terminal, sure. But if you're outputting to a log file, or running in a CI environment where your stdout is being captured and redirected to a parent process, moving to buffered IO changes the semantics significantly. Imagine this code:

    log("dangerous precondition hit, logging some state:" + state);
    function_crashes_here()

With buffered IO, you very likely lose your log line.

[0] https://stackoverflow.com/questions/48535727/why-are-c-stl-v...

gpderetta · on Jan 16, 2024

that's why you to take a best-effort attempt at flushing your logs in your crash handler.

You can't realistically measure everything. You start with a sensible design and implement the reasonable optimizations you learned by experience. Then measure and improve when and if you fail to hit your target.

IshKebab · on Jan 16, 2024

That's nothing to do with performance. Profiling won't help you catch that issue.

maccard · on Jan 16, 2024

Performance doesn't exist in a vacuum. The fastest way to optimise a loop is to remove all the code in it, but you're changing the program in order to do so.

Adding/removing IO buffering absolutely _does_ have side effects. If someone is running `./app >> /dev/null 2>&1`, then mucking arouind with IO buffering is going to have little to no effect.

IshKebab · on Jan 16, 2024

Ok but that is irrelevant to the discussion which was about whether profiling is absolutely necessary 100% of the time.

bluGill · on Jan 16, 2024

I just measured a large vector and reserving makes minimal differnce for my case. Ammortized growth is a not slow. Ymmv of course which is why we measure.

IshKebab · on Jan 16, 2024

At worst it makes no difference and at best it makes a big difference - maybe on an input that you didn't even measure!

maccard · on Jan 16, 2024

At worst it will make it slower, and at best it will make it faster[0]. The only way to know is to measure with an input that's representative, and this case is exactly why you measure.

[0] https://stackoverflow.com/questions/48535727/why-are-c-stl-v...

bluGill · on Jan 16, 2024

Which is why I measured with many different sizes. I know the fixed size version (the existing code), so I measured with 1 element, half the fixed sized version, the fixed sized version, and double the fixed size version. All with different amounts of reserved data elements. The fixed size version is faster, but not by enough to care about - at least in my case where the vector is created at startup and never changed after.

lifthrasiir · on Jan 16, 2024

Measure, but know that measurements may (very frequently) go wrong.

In this particular case, there seems no actually measurable difference between two codes. You can verify that the actual bytecode change is `c = a2 + c` vs. `c += a2`, but Python int doesn't have a separate implementation for `+=` (`__iadd__`, or `nb_inplace_add` in C) presumably because all integers have to be immutable anyway, so they can't differ that much. [1] And others have pointed out that Python in general doesn't run anything in parallel unless explicitly requested.

Therefore any difference is very likely to be due to the faulty benchmark method. The post suspiciously doesn't mention any benchmarking code (and even what the heck is `square_root_ceil`, which I assumed to be `lambda x: int(math.ceil(math.sqrt(x)))` below), but I did a quick benchmarking and got the following result in a particular environment:

     #3, small     #3, large     #4, small     #4, large
    -----------   -----------   -----------   -----------
     min   avg     min   avg     min   avg     min   avg
    ----- -----   ----- -----   ----- -----   ----- -----
    0.188 0.228   0.609 0.687   0.203 0.244   0.610 0.695
    0.203 0.215   0.610 0.642   0.203 0.211   0.625 0.653
    0.203 0.214   0.609 0.641   0.188 0.208   0.609 0.627
    0.203 0.213   0.609 0.622   0.187 0.208   0.609 0.630
    0.187 0.209   0.609 0.622   0.187 0.208   0.609 0.630

Here I'm calling `factor_fermat3` with small and large arguments, repeat that 10 times (so 20 calls), then do the same for `factor_fermat4`, and then report a single line. I did this 5 times in a row, and it is evident that the first few runs were substantially slower than later.

Modern CPUs vary their frequency to save power, and it takes some (or even much) time for CPUs to adapt to the new workload that demands higher or lower frequency. If I have put `time.sleep` between each run it might be possible that the first run repeats every time, but that would be inaccurate. For multicore systems it's also possible that the core may change over time, with separate power states, so a proper benchmarking needs some core pinning to avoid such issues.

If you only did the first run you may have concluded that #4 was slower than #3, but subsequent runs show that this is not true and they are more or less same especially when you consider the minimum time per call. The average time in comparison is highly variable and you need lots of considerations to stabilize that, which I never did in this case. Also you need statistical testing to conclude that such difference was not a fluke, which is another vast subject. In any case, this post doesn't show any such attempt.

[1] Note that `c += a2` is equivalent to `c = c + a2` and not `c = a2 + c`, which can make a difference when multiprecision arithmetic routines do not account for such cases. I tested both cases and got the same result, however.

vlovich123 · on Jan 16, 2024

Yeah, modern CPUs vary frequencies and you have to be very very careful about how you measure to account for this. This had already been true for more than 10 years now and should be muscle memory when benchmarking but it can be easy to overlook and some people don’t know. It doesn’t surprise me that someone hypothesizing that Python automatically executes different lines of code on different cores might make this mistake as they don’t have a good mental model of how the code they run executes on their machine. As you note they probably made other methodological mistakes too.

jcranmer · on Jan 16, 2024

This really needs to be upvoted harder.

The post doesn't say what units the terms a measured in, but it's evident from a quick test that the unit is ms. And when you're in the sub-ms time domain, if you don't know what you're doing, benchmarking is apt to reflect noise in its results.

gpderetta · on Jan 16, 2024

Good point, measuring is hard. Make sure you are actually know what you are measuring, and if the measurements differ form your expectations, make sure you understand why.