The SHA-256 Project: learn how hash functions work by implementing one yourself

josnyder · on March 27, 2022

OP appears to be Jack O'Connor, one of the designers of BLAKE3, which is the fastest full-strength cryptographic hash function currently available. It's always nice to see practicing cryptographers also producing digestible cryptography content.

upcoming-sesame · on March 27, 2022

"fastest hash function" - I do not know much about cryptography, but isn't a hash function slow by design, or at least slow enough to prevent brute force ?

oconnor663 · on March 27, 2022

No but this is a common misconception. The key thing to remember is that a regular hash function and a "password hash" are very different. A password hash does need to be slow, because we want to make it expensive for an attacker who's stolen the hash to try to guess the password. But in most other use cases, we want hash functions to be as fast as possible, as long as they can uphold their security guarantees.

A maybe-interesting sidebar: If your job is to design a password hash, you still want to take a fast hash function as a starting point. All the effort that goes into making the hash function both fast and secure gives you confidence that an attacker isn't going to be able to find some shortcut that lets them compute the hash it more quickly than everyone else.

foxyv · on March 28, 2022

Hello, great question!

Hashes must only be slow in certain circumstances where the plaintext of the hash is vulnerable to a "pre-image" attack. This means that the entropy of the plaintext is very low and there are very few possibilities. In these cases you use extremely slow hashes like bcrypt or be forced to use AES encryption before hashing. For example: Passwords, Credit Card Security Digits, Names, etc...

Plaintext that has a LOT of entropy does not need a slow hash function. You can usually get by with SHA-256. For example: digital signatures, AES keys, Public/Private keys, large file hashes, etc...

In addition, if you are only trying to verify integrity with no need for security, a quick hash function is useful. Like a checksum on a file system. In this case a fast hash function is more important. EG: https://ext4.wiki.kernel.org/index.php/Ext4_Metadata_Checksu...

oconnor663 · on March 28, 2022

All of this is mostly right, but I have some minor corrections:

> or be forced to use AES encryption before hashing

Encrypting before hashing doesn't actually help us, if our hash input has low entropy. The problem is, what AES key are we going to use? If the key is public/constant, the attacker can just repeat the AES step themselves while they try to guess the input. On the other hand, if the key is secret, then we need to store it somewhere. Most likely, we're going to store it in the same place/database where we store the hash. (In this case it's basically the same as a random "salt".) But that means than at attacker who steals password hashes is probably going to steal salts at the same time.

> large file hashes

This is often true, but we have to be careful. What we really care about isn't the size of the file per se, but the number of possible inputs the attacker has to guess. So for example, if we hash a video file we downloaded from a bittorrent site, it might be easy for someone else to figure out what file that was, because even though the file is large, the total number of files on these sites isn't so large.

foxyv · on March 28, 2022

You are entirely correct. Unfortunately anything short of a textbook is going to be "Mostly Right" when it comes to security/encryption.

> Encrypting before hashing doesn't actually help us

Technically correct. There is no additional need to hash once you are encrypting already. However, when compliance tells you that you need to use hashing but you still need sub 10ms performance you sometimes need to use encryption. The disadvantage is the necessity for key management, HSMs, etc... Anything with low enough entropy is going to NEED encryption unless you are using ridiculous BCrypt or PBKDF2 parameters and are willing to wait a day or two to verify a hash.

> Large file hashes

If you salt the file data appropriately this is not an issue. The salt can be a part of the encrypted file ensuring that your hash doesn't collide.

f7ebc20c97 · on March 27, 2022

Proof-of-work cryptocurrencies moved toward complex hash functions that are "ASIC resistant" in order to attempt to keep mining power as distributed as possible (for CPUs), because whatever highly-funded manufacturer designs the most-optimized and fastest ASICs gains concentrated mining power. ASIC hash/W efficiency doubles every couple of years, so using them as heaters becomes cost-ineffective as difficulty adjusts.

But what if we turned this all on its head and intentionally designed a hash function that is "ASIC optimized" so that the perfect ASIC miner design could be found quickly (perhaps at its very conception), and miners became a generic commodity item, "grey goo miners", and used them as space heaters in cold climates, or maybe even wherever electric heating was needed in industry? That way mining would be as distributed as the need for electric heating, and its energy would be put to good use. It might also put a ceiling on the total electricity consumed because it would be hard to compete with any miner whose electric bills are "free" (relatively).

deadbeeves · on March 27, 2022

>what if we turned this all on its head and intentionally designed a hash function that is "ASIC optimized" so that the perfect ASIC miner design could be found quickly

Since hash functions are applied to variable-sized parameters, optimizations always exist. You can always compute the hash by processing less of the input, or by processing it in parallel. Just because it's not known how to do this doesn't mean it's impossible. It's impossible to know when an implementation of an algorithm (whether in software or hardware) is optimal.

f7ebc20c97 · on March 27, 2022

Not necessarily. The mining algorithm could hash the block once using SHA256, then use that and a fixed-size nonce as a fixed-size parameter for the ASIC-optimized hash function.

deadbeeves · on March 27, 2022

A conceivable optimization would be to compute the final digest by skipping the SHA-256 step.

f7ebc20c97 · on March 27, 2022

Isn't 99% optimal good enough?

SilasX · on March 27, 2022

I thought that’s the world we’re already in with bitcoin, since it never changed the hash function, and ASICs of increasing efficiency have been built for it and used to harness energy that would otherwise be lost?

yunohn · on March 27, 2022

Are you seriously advocating crypto mining specially for heating? Would be great if you linked some studies on the efficiency of ASICs vs common heaters.

f7ebc20c97 · on March 27, 2022

They're both ~100% efficient at converting electricity into heat.

cedilla · on March 27, 2022

Which is spectacularly inefficient. (If you wonder how that could be, here's a hint: your home is not an isolated system)

f7ebc20c97 · on March 27, 2022

Please, let's not get bogged down in a discussion about heat pumps.

dragontamer · on March 27, 2022

Fine.

New York City's heating system is powered from the "waste heat" of nearby electricity plants. https://en.wikipedia.org/wiki/New_York_City_steam_system

Not that NYC is particularly unique on this matter, plenty of different cities do this. https://en.wikipedia.org/wiki/District_heating

As such, New York City's heating is practically free (repurposed "waste heat" from the nearby power-plants).

Your definition of "100% efficient" is rather... bad. Both heat pumps, and NYC Steam Heating systems are far above 100% efficient by your definition.

------

There's plenty of ways to go "above 100% efficient", when it comes to BTC-as-a-heater idea.

foobiekr · on March 27, 2022

Why not? The proponents of “efficient” heating with Bitcoin ASICs are gearing up to mentally lock in another mythical “benefit.”

nemo1618 · on March 27, 2022

Yep, it didn't take long for Bitcoiners to arrive at this idea. Here's Andrew Poelstra arguing in 2015 that PoW should approach the thermodynamic limit: https://download.wpsoftware.net/bitcoin/asic-faq.pdf

f7ebc20c97 · on March 27, 2022

I wonder when efficiency gains will be slow enough so that a PoW heater will pay for itself before it's obsolete.

y42 · on March 27, 2022

Wrote a similar article, in Barbarian language for the German audience here, and completely in Python.

I found it really difficult to find written and understandable material for this topic. Mostly people posts Youtube clickbait videos about that.

Though, this one goes far beyond the simple process. So I almost don't dare to compare my profane post to this elaborarte description.

"Wie funktioniert der SHA256 Algorithmus…im Detail? (Teil 1/2) – nicky reinert" https://nickyreinert.de/blog/2021/10/31/wie-funktioniert-der...

k__ · on March 27, 2022

"Deprecated: implode(): Passing glue string after array is deprecated. Swap the parameters in /htdocs/wp-content/plugins/crayon-syntax-highlighter/util/crayon_util.class.php on line 73"

Just a heads up.

chimeracoder · on March 27, 2022

> in Barbarian language for the German audience here

I assume this is supposed to say "Bavarian"?

lfkdev · on March 27, 2022

> Barbarian language

nice

nemo1618 · on March 27, 2022

I was hoping this would explain a bit more about why the algorithm does the things it does. Do we have proofs for why certain types of mixing are more secure than others?

Also, it's funny how even world-class cryptographers can think it's perfectly fine to directly output internal state as the digest. Some things are only obvious in retrospect.

dragontamer · on March 27, 2022

> I was hoping this would explain a bit more about why the algorithm does the things it does. Do we have proofs for why certain types of mixing are more secure than others?

Only the most simple of proofs for the most simple of circumstances.

Lets say you have a 2-bit number (x0, and x1 for the two bits) that you want to mix up / encrypt. The output can be 0-bits, 1-bit, or 2-bits. 3-bits or more wouldn't work (8-outputs vs 4-inputs).

1-bit "wastes a bit" of information (you go from either 4-possible inputs into 2-possible outputs).

Therefore, having 2-bit input (4-possible inputs), your maximal amount of "mixing up" you can do is to have 4-possible outputs. That is, a 2-bit input should mix into a 2-bit output.

There are patterns for how to make these "bijections". AES used Galois Fields (which are always a bijection), as well as S-tables (substitution tables, or 8-bit / 256-input to 8-bit / 256-output tables to mix things up 8-bits at a time).

It turns out that ADD / XOR / BITSHIFT are all you need to make a modern, high-speed bijection. So more modern ciphers prefer those functions instead.

-------

From there on out, there's just a lot of testing that goes on. You run tests to see if any bits seem "corelated" to each other (ie: Differential Analysis. When Bit#25 changes, does that cause Bit#66 to change 50% of the time? Or does it change 52% of the time? Because 52% of the time would allow us to run probability attacks and figure out how Bit#25 and #66 are connected with just 25-or-so trials).

So you try to make your functions as "flat" as possible, to defeat probability analysis (also known as differential analysis). There are other probability/statistical tests, but these are largely ad-hoc ideas.

So you learn all the cryptography-probability tricks about how other ciphers have failed (correlation between bits, etc. etc.), and emperically test your functions against such analysis. Make your probability distributions as "flat" as possible (50.00000% correlation between bits), and you'll defeat differential-cryptoanalysis.

--------

The only way for 2-bits of input to mix into 2-bits of output is to have a 1-to-1 and onto function, also known as a bijection (and sometimes called a permutation).

> Also, it's funny how even world-class cryptographers can think it's perfectly fine to directly output internal state as the digest. Some things are only obvious in retrospect.

What's wrong with this? There's the same information at that point. You only have another permutation function. Once the internal state is "sufficiently mixed up" (that is... in the 2-bit case... the 2-bits of input can become 2-bits of output in a sufficiently mixed up way), you can't get any better than just returning the input state.

Once all 256-bits of input have a 50% chance of changing all 256-bits of state, then you can just output the internal state as your "answer". In fact, "mixing" them up any further might accidentally throw off your statistical analysis... so its better to NOT muck with the number anymore.

schoen · on March 27, 2022

> What's wrong with this? There's the same information at that point. You only have another permutation function. Once the internal state is "sufficiently mixed up" (that is... in the 2-bit case... the 2-bits of input can become 2-bits of output in a sufficiently mixed up way), you can't get any better than just returning the input state.

I'm sure the original poster is referring to https://en.wikipedia.org/wiki/Length_extension_attack

It seems like the general solution in the SHA family has been to return a subset of the internal state ("truncation"). This then creates unknown state that an attacker would have to guess in order to perform a length extension.

https://en.wikipedia.org/wiki/SHA-2#Comparison_of_SHA_functi...

Edit: Indeed, the original article turns out to teach you to implement the length extension attack at the end, and states that it "was a design mistake" in SHA-256 not to truncate.

dragontamer · on March 27, 2022

Fair enough. But that argument is a bit subtle.

Going back to the 2 bit input example, if your internal state is 4 bits, you only have 2 bits of entropy.

So outputting the first two bits of your 4 but state is still (probably) valid as far as a hash is concerned, assuming a good enough mixing function.

But at the same time, if you need all 4 bits to adequately continue the hash, then 'hiding' those other two bits will help out.

So it's not so much the truncation, as much as it is the unnecessary expansion and then later truncation steps that's useful for hashing. (Ex: SHA512 truncated to 256 bits is immune to the attack described).

In any case, these attacks are ad-hoc and don't seem to have any pattern. Crypto-experts design against attacks like these. Since there is no pattern, you end up having to study all these attacks and designing against all of them.

userbinator · on March 27, 2022

The most interesting part of this is the section on Length Extension Attack. Implementing SHA256 isn't all that interesting since it's just a mechanistic translation of an algorithm, but the attack shows off one of the implications of how the function works.

7373737373 · on March 27, 2022

I found this video to be an excellent explanation and sufficient to implement the algorithm: https://www.youtube.com/watch?v=f9EbD6iY9zI

Code: https://gist.github.com/void4/6f5ff23a3df81d6115fceb6adefddd...

This site contains a nice visualization: https://sha256algorithm.com/

staticassertion · on March 27, 2022

I really like this. Cryptography is obviously very hard to get right and complex, and for that reason people have said "don't roll your own crypto" - but really, that's stupid. What we need is accessible information to explain cryptography to people, so they know how to make smart crypto decisions, know what pitfalls exist, etc.

I think crypto has done a particularly bad job of providing accessible materials, historically, and has had a terrible attitude towards doing so. Things are changing, and that's nice.

traceroute66 · on March 27, 2022

> people have said "don't roll your own crypto" - but really, that's stupid. What we need is accessible information to explain cryptography to people, so they know how to make smart crypto decisions

I disagree.

The "crypto is hard" saying is there for a reason. It is scarily easy to make a crypto-breaking mistake in a simple looking piece of code.

Yes people need to know "how to make smart crypto decisions", but when you are dealing with sysadmins or other end-users, I put it to you that they could not care less about the mathematical theory, what they really want/need to know is "what settings should I be using with $insert_name_of_crypto_library in my config file".

Let me put it to you another way...

If we extend your theory, then we would have to teach the world basic chemistry about Sodium hypochlorite so that they can make "smart decisions" about whether or not to drink a bottle of bleach.

But maybe its just simpler to use the "crypto library" approach and stick a couple of globally recognised scary looking icons on the bottle along with a short paragraph of generic text in the local language ?

vladvasiliu · on March 27, 2022

The two are not incompatible. Sure, most people are better off using off-the-shelf crypto or chemistry instead of rolling their own.

But as a society, we stand to benefit from educating people. Someone has to actually build the off-the-shelf crypto, right? So if strictly everybody is told "nah, don't roll your own", who's going to advance the field?

My point is that we should facilitate people understanding things. Sure, this may push some people to figure "meh, I know everything, let me just roll my own" and proceed to build something buggy. But that happens anyway.

> Yes people need to know "how to make smart crypto decisions", but when you are dealing with sysadmins or other end-users, I put it to you that they could not care less about the mathematical theory, what they really want/need to know is "what settings should I be using with $insert_name_of_crypto_library in my config file".

I'm a sysadmin and very much like to know how the things I deploy and manage work, so I can make decisions about things. That doesn't mean I'm out here writing my own OS or anything.

Also, I think being educated allows people to be less susceptible to snake-oil salesmen and the like.

A good current-day example is the debate about cryptography backdoors. Maybe if more people understood how things worked, they wouldn't keep parroting the line of "nothing to hide [from the government]", because they'd understand that bad guys are extremely likely to have that very same access.

charcircuit · on March 27, 2022

>because they'd understand that bad guys are extremely likely to have that very same access.

Despite access being available most "bad guys" / people wanting to talk to one use popular platforms just like "good guys." Big platforms are typically easier to use and have a lot of other people already on it. It's not about what people have access to, it is what people use.

adityeah · on March 27, 2022

Hi. Can I just say that you both are right and I find myself agreeing with both of you? Thank you.

traceroute66 · on March 27, 2022

> But as a society, we stand to benefit from educating people. Someone has to actually build the off-the-shelf crypto, right? So if strictly everybody is told "nah, don't roll your own", who's going to advance the field?

I'm sorry but I don't buy that argument.

The main problem with that line of argument is where do you stop.

I mean sure, its great to "educate people" but in the end each person only has so many hours in the day. They also have other constraints on their time (jobs, families etc.).

If we take your argument to its logical conclusion then you will tell me that I cannot consider visiting a Chinese Restaurant unless I have educated myself by learning Chinese and the intricacies of Chinese cuisine and cooking techniques so then I can make "smart decisions" about what I eat for dinner when all I really want is some roast duck and some steamed prawn dumplings ?

Sheesh.

vladvasiliu · on March 27, 2022

That's really not my argument.

I'm not talking about forcing education on anyone, and preventing them from doing things if they're uneducated.

Rather, I'm all for making education easily available. Sure, if you want to roll your own crypto or go to a Chinese restaurant without learning about those things, go ahead. I have absolutely no problem with that.

My point is that education about those (and really, any) topics shouldn't be hard to come by if you're interested.

That doesn't mean that you should have to pass a Chinese culture exam in order to be allowed in a restaurant.

But if you're curious about Chinese cuisine, it's better if you don't have to jump through hoops to learn about it, or have to visit 10000 restaurants for 10 years, so you can learn about it through "trial and error".

---

edit:

The specific line your first comment cited from its parent is, emphasis mine:

>> people have said "don't roll your own crypto" - but really, that's stupid. What we need is accessible information to explain cryptography to people, so they know how to make smart crypto decisions

This is what I'm replying to. Easy access doesn't mean forcing anyone to learn anything.

userbinator · on March 27, 2022

then we would have to teach the world basic chemistry

When I was in high school, they did. Is that not the case anymore?

jimmygrapes · on March 27, 2022

It is still taught from a chemical standpoint, but the second and third order consequences of ingestion are not really addressed. You might get a generic disclaimer somewhere saying "harmful to humans" here and there.

staticassertion · on March 27, 2022

This isn't forcing people to learn crypto, it's providing resources so that they can. Obviously it's simpler to use crypto libraries.

zgs · on March 27, 2022

The skills needed to write good crypto are different to those needed to use it. There is some overlap of course, but less than is implicated here.

Probably the best advice for using crypto is to use algorithms for what they were intended. Which will require some reading and the materials aren't as accessible as they ought to be.

oconnor663 · on March 27, 2022

Yeah I have mixed feelings about "don't roll your own crypto". On the one hand, I don't like discouraging learning. On the other hand, a culture of putting warnings in our README's like "I rolled this myself so it's probably broken and not suitable for production use" is a good thing.

If I could expand our traditional warning a little bit, I might say something like this:

If you roll your own crypto, you're going to make mistakes. That's fine; mistakes are how we learn. But crypto mistakes tend to be invisible, and your code will appear to work even when it has no security. So if you want to learn crypto, you need to find a way to get feedback about your invisible mistakes.

JaggerFoo · on March 27, 2022

I wonder how the newer hashes, like Reinforced Concrete, that are used in Zero-Knowledge applications compare to SHA-256.

I mean can Reinforced Concrete replace SHA-256, as the primary hash, in a blockchain?

deadbeeves · on March 27, 2022

Hash functions are generally interchangeable except in their length and their speed.

zgs · on March 27, 2022

and brokenness :)