Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
The SHA-256 Project: learn how hash functions work by implementing one yourself (github.com/oconnor663)
213 points by oconnor663 on March 27, 2022 | hide | past | favorite | 45 comments


OP appears to be Jack O'Connor, one of the designers of BLAKE3, which is the fastest full-strength cryptographic hash function currently available. It's always nice to see practicing cryptographers also producing digestible cryptography content.


"fastest hash function" - I do not know much about cryptography, but isn't a hash function slow by design, or at least slow enough to prevent brute force ?


No but this is a common misconception. The key thing to remember is that a regular hash function and a "password hash" are very different. A password hash does need to be slow, because we want to make it expensive for an attacker who's stolen the hash to try to guess the password. But in most other use cases, we want hash functions to be as fast as possible, as long as they can uphold their security guarantees.

A maybe-interesting sidebar: If your job is to design a password hash, you still want to take a fast hash function as a starting point. All the effort that goes into making the hash function both fast and secure gives you confidence that an attacker isn't going to be able to find some shortcut that lets them compute the hash it more quickly than everyone else.


Hello, great question!

Hashes must only be slow in certain circumstances where the plaintext of the hash is vulnerable to a "pre-image" attack. This means that the entropy of the plaintext is very low and there are very few possibilities. In these cases you use extremely slow hashes like bcrypt or be forced to use AES encryption before hashing. For example: Passwords, Credit Card Security Digits, Names, etc...

Plaintext that has a LOT of entropy does not need a slow hash function. You can usually get by with SHA-256. For example: digital signatures, AES keys, Public/Private keys, large file hashes, etc...

In addition, if you are only trying to verify integrity with no need for security, a quick hash function is useful. Like a checksum on a file system. In this case a fast hash function is more important. EG: https://ext4.wiki.kernel.org/index.php/Ext4_Metadata_Checksu...


All of this is mostly right, but I have some minor corrections:

> or be forced to use AES encryption before hashing

Encrypting before hashing doesn't actually help us, if our hash input has low entropy. The problem is, what AES key are we going to use? If the key is public/constant, the attacker can just repeat the AES step themselves while they try to guess the input. On the other hand, if the key is secret, then we need to store it somewhere. Most likely, we're going to store it in the same place/database where we store the hash. (In this case it's basically the same as a random "salt".) But that means than at attacker who steals password hashes is probably going to steal salts at the same time.

> large file hashes

This is often true, but we have to be careful. What we really care about isn't the size of the file per se, but the number of possible inputs the attacker has to guess. So for example, if we hash a video file we downloaded from a bittorrent site, it might be easy for someone else to figure out what file that was, because even though the file is large, the total number of files on these sites isn't so large.


You are entirely correct. Unfortunately anything short of a textbook is going to be "Mostly Right" when it comes to security/encryption.

> Encrypting before hashing doesn't actually help us

Technically correct. There is no additional need to hash once you are encrypting already. However, when compliance tells you that you need to use hashing but you still need sub 10ms performance you sometimes need to use encryption. The disadvantage is the necessity for key management, HSMs, etc... Anything with low enough entropy is going to NEED encryption unless you are using ridiculous BCrypt or PBKDF2 parameters and are willing to wait a day or two to verify a hash.

> Large file hashes

If you salt the file data appropriately this is not an issue. The salt can be a part of the encrypted file ensuring that your hash doesn't collide.


Proof-of-work cryptocurrencies moved toward complex hash functions that are "ASIC resistant" in order to attempt to keep mining power as distributed as possible (for CPUs), because whatever highly-funded manufacturer designs the most-optimized and fastest ASICs gains concentrated mining power. ASIC hash/W efficiency doubles every couple of years, so using them as heaters becomes cost-ineffective as difficulty adjusts.

But what if we turned this all on its head and intentionally designed a hash function that is "ASIC optimized" so that the perfect ASIC miner design could be found quickly (perhaps at its very conception), and miners became a generic commodity item, "grey goo miners", and used them as space heaters in cold climates, or maybe even wherever electric heating was needed in industry? That way mining would be as distributed as the need for electric heating, and its energy would be put to good use. It might also put a ceiling on the total electricity consumed because it would be hard to compete with any miner whose electric bills are "free" (relatively).


>what if we turned this all on its head and intentionally designed a hash function that is "ASIC optimized" so that the perfect ASIC miner design could be found quickly

Since hash functions are applied to variable-sized parameters, optimizations always exist. You can always compute the hash by processing less of the input, or by processing it in parallel. Just because it's not known how to do this doesn't mean it's impossible. It's impossible to know when an implementation of an algorithm (whether in software or hardware) is optimal.


Not necessarily. The mining algorithm could hash the block once using SHA256, then use that and a fixed-size nonce as a fixed-size parameter for the ASIC-optimized hash function.


A conceivable optimization would be to compute the final digest by skipping the SHA-256 step.


Isn't 99% optimal good enough?


I thought that’s the world we’re already in with bitcoin, since it never changed the hash function, and ASICs of increasing efficiency have been built for it and used to harness energy that would otherwise be lost?


Are you seriously advocating crypto mining specially for heating? Would be great if you linked some studies on the efficiency of ASICs vs common heaters.


They're both ~100% efficient at converting electricity into heat.


Which is spectacularly inefficient. (If you wonder how that could be, here's a hint: your home is not an isolated system)


Please, let's not get bogged down in a discussion about heat pumps.


Fine.

New York City's heating system is powered from the "waste heat" of nearby electricity plants. https://en.wikipedia.org/wiki/New_York_City_steam_system

Not that NYC is particularly unique on this matter, plenty of different cities do this. https://en.wikipedia.org/wiki/District_heating

As such, New York City's heating is practically free (repurposed "waste heat" from the nearby power-plants).

Your definition of "100% efficient" is rather... bad. Both heat pumps, and NYC Steam Heating systems are far above 100% efficient by your definition.

------

There's plenty of ways to go "above 100% efficient", when it comes to BTC-as-a-heater idea.


Why not? The proponents of “efficient” heating with Bitcoin ASICs are gearing up to mentally lock in another mythical “benefit.”


Yep, it didn't take long for Bitcoiners to arrive at this idea. Here's Andrew Poelstra arguing in 2015 that PoW should approach the thermodynamic limit: https://download.wpsoftware.net/bitcoin/asic-faq.pdf


I wonder when efficiency gains will be slow enough so that a PoW heater will pay for itself before it's obsolete.


Wrote a similar article, in Barbarian language for the German audience here, and completely in Python.

I found it really difficult to find written and understandable material for this topic. Mostly people posts Youtube clickbait videos about that.

Though, this one goes far beyond the simple process. So I almost don't dare to compare my profane post to this elaborarte description.

"Wie funktioniert der SHA256 Algorithmus…im Detail? (Teil 1/2) – nicky reinert" https://nickyreinert.de/blog/2021/10/31/wie-funktioniert-der...


"Deprecated: implode(): Passing glue string after array is deprecated. Swap the parameters in /htdocs/wp-content/plugins/crayon-syntax-highlighter/util/crayon_util.class.php on line 73"

Just a heads up.


> in Barbarian language for the German audience here

I assume this is supposed to say "Bavarian"?


> Barbarian language

nice


I was hoping this would explain a bit more about why the algorithm does the things it does. Do we have proofs for why certain types of mixing are more secure than others?

Also, it's funny how even world-class cryptographers can think it's perfectly fine to directly output internal state as the digest. Some things are only obvious in retrospect.


> I was hoping this would explain a bit more about why the algorithm does the things it does. Do we have proofs for why certain types of mixing are more secure than others?

Only the most simple of proofs for the most simple of circumstances.

Lets say you have a 2-bit number (x0, and x1 for the two bits) that you want to mix up / encrypt. The output can be 0-bits, 1-bit, or 2-bits. 3-bits or more wouldn't work (8-outputs vs 4-inputs).

1-bit "wastes a bit" of information (you go from either 4-possible inputs into 2-possible outputs).

Therefore, having 2-bit input (4-possible inputs), your maximal amount of "mixing up" you can do is to have 4-possible outputs. That is, a 2-bit input should mix into a 2-bit output.

There are patterns for how to make these "bijections". AES used Galois Fields (which are always a bijection), as well as S-tables (substitution tables, or 8-bit / 256-input to 8-bit / 256-output tables to mix things up 8-bits at a time).

It turns out that ADD / XOR / BITSHIFT are all you need to make a modern, high-speed bijection. So more modern ciphers prefer those functions instead.

-------

From there on out, there's just a lot of testing that goes on. You run tests to see if any bits seem "corelated" to each other (ie: Differential Analysis. When Bit#25 changes, does that cause Bit#66 to change 50% of the time? Or does it change 52% of the time? Because 52% of the time would allow us to run probability attacks and figure out how Bit#25 and #66 are connected with just 25-or-so trials).

So you try to make your functions as "flat" as possible, to defeat probability analysis (also known as differential analysis). There are other probability/statistical tests, but these are largely ad-hoc ideas.

So you learn all the cryptography-probability tricks about how other ciphers have failed (correlation between bits, etc. etc.), and emperically test your functions against such analysis. Make your probability distributions as "flat" as possible (50.00000% correlation between bits), and you'll defeat differential-cryptoanalysis.

--------

The only way for 2-bits of input to mix into 2-bits of output is to have a 1-to-1 and onto function, also known as a bijection (and sometimes called a permutation).

> Also, it's funny how even world-class cryptographers can think it's perfectly fine to directly output internal state as the digest. Some things are only obvious in retrospect.

What's wrong with this? There's the same information at that point. You only have another permutation function. Once the internal state is "sufficiently mixed up" (that is... in the 2-bit case... the 2-bits of input can become 2-bits of output in a sufficiently mixed up way), you can't get any better than just returning the input state.

Once all 256-bits of input have a 50% chance of changing all 256-bits of state, then you can just output the internal state as your "answer". In fact, "mixing" them up any further might accidentally throw off your statistical analysis... so its better to NOT muck with the number anymore.


> What's wrong with this? There's the same information at that point. You only have another permutation function. Once the internal state is "sufficiently mixed up" (that is... in the 2-bit case... the 2-bits of input can become 2-bits of output in a sufficiently mixed up way), you can't get any better than just returning the input state.

I'm sure the original poster is referring to https://en.wikipedia.org/wiki/Length_extension_attack

It seems like the general solution in the SHA family has been to return a subset of the internal state ("truncation"). This then creates unknown state that an attacker would have to guess in order to perform a length extension.

https://en.wikipedia.org/wiki/SHA-2#Comparison_of_SHA_functi...

Edit: Indeed, the original article turns out to teach you to implement the length extension attack at the end, and states that it "was a design mistake" in SHA-256 not to truncate.


Fair enough. But that argument is a bit subtle.

Going back to the 2 bit input example, if your internal state is 4 bits, you only have 2 bits of entropy.

So outputting the first two bits of your 4 but state is still (probably) valid as far as a hash is concerned, assuming a good enough mixing function.

But at the same time, if you need all 4 bits to adequately continue the hash, then 'hiding' those other two bits will help out.

So it's not so much the truncation, as much as it is the unnecessary expansion and then later truncation steps that's useful for hashing. (Ex: SHA512 truncated to 256 bits is immune to the attack described).

In any case, these attacks are ad-hoc and don't seem to have any pattern. Crypto-experts design against attacks like these. Since there is no pattern, you end up having to study all these attacks and designing against all of them.


The most interesting part of this is the section on Length Extension Attack. Implementing SHA256 isn't all that interesting since it's just a mechanistic translation of an algorithm, but the attack shows off one of the implications of how the function works.


I found this video to be an excellent explanation and sufficient to implement the algorithm: https://www.youtube.com/watch?v=f9EbD6iY9zI

Code: https://gist.github.com/void4/6f5ff23a3df81d6115fceb6adefddd...

This site contains a nice visualization: https://sha256algorithm.com/


I really like this. Cryptography is obviously very hard to get right and complex, and for that reason people have said "don't roll your own crypto" - but really, that's stupid. What we need is accessible information to explain cryptography to people, so they know how to make smart crypto decisions, know what pitfalls exist, etc.

I think crypto has done a particularly bad job of providing accessible materials, historically, and has had a terrible attitude towards doing so. Things are changing, and that's nice.


> people have said "don't roll your own crypto" - but really, that's stupid. What we need is accessible information to explain cryptography to people, so they know how to make smart crypto decisions

I disagree.

The "crypto is hard" saying is there for a reason. It is scarily easy to make a crypto-breaking mistake in a simple looking piece of code.

Yes people need to know "how to make smart crypto decisions", but when you are dealing with sysadmins or other end-users, I put it to you that they could not care less about the mathematical theory, what they really want/need to know is "what settings should I be using with $insert_name_of_crypto_library in my config file".

Let me put it to you another way...

If we extend your theory, then we would have to teach the world basic chemistry about Sodium hypochlorite so that they can make "smart decisions" about whether or not to drink a bottle of bleach.

But maybe its just simpler to use the "crypto library" approach and stick a couple of globally recognised scary looking icons on the bottle along with a short paragraph of generic text in the local language ?


The two are not incompatible. Sure, most people are better off using off-the-shelf crypto or chemistry instead of rolling their own.

But as a society, we stand to benefit from educating people. Someone has to actually build the off-the-shelf crypto, right? So if strictly everybody is told "nah, don't roll your own", who's going to advance the field?

My point is that we should facilitate people understanding things. Sure, this may push some people to figure "meh, I know everything, let me just roll my own" and proceed to build something buggy. But that happens anyway.

> Yes people need to know "how to make smart crypto decisions", but when you are dealing with sysadmins or other end-users, I put it to you that they could not care less about the mathematical theory, what they really want/need to know is "what settings should I be using with $insert_name_of_crypto_library in my config file".

I'm a sysadmin and very much like to know how the things I deploy and manage work, so I can make decisions about things. That doesn't mean I'm out here writing my own OS or anything.

Also, I think being educated allows people to be less susceptible to snake-oil salesmen and the like.

A good current-day example is the debate about cryptography backdoors. Maybe if more people understood how things worked, they wouldn't keep parroting the line of "nothing to hide [from the government]", because they'd understand that bad guys are extremely likely to have that very same access.


>because they'd understand that bad guys are extremely likely to have that very same access.

Despite access being available most "bad guys" / people wanting to talk to one use popular platforms just like "good guys." Big platforms are typically easier to use and have a lot of other people already on it. It's not about what people have access to, it is what people use.


Hi. Can I just say that you both are right and I find myself agreeing with both of you? Thank you.


> But as a society, we stand to benefit from educating people. Someone has to actually build the off-the-shelf crypto, right? So if strictly everybody is told "nah, don't roll your own", who's going to advance the field?

I'm sorry but I don't buy that argument.

The main problem with that line of argument is where do you stop.

I mean sure, its great to "educate people" but in the end each person only has so many hours in the day. They also have other constraints on their time (jobs, families etc.).

If we take your argument to its logical conclusion then you will tell me that I cannot consider visiting a Chinese Restaurant unless I have educated myself by learning Chinese and the intricacies of Chinese cuisine and cooking techniques so then I can make "smart decisions" about what I eat for dinner when all I really want is some roast duck and some steamed prawn dumplings ?

Sheesh.


That's really not my argument.

I'm not talking about forcing education on anyone, and preventing them from doing things if they're uneducated.

Rather, I'm all for making education easily available. Sure, if you want to roll your own crypto or go to a Chinese restaurant without learning about those things, go ahead. I have absolutely no problem with that.

My point is that education about those (and really, any) topics shouldn't be hard to come by if you're interested.

That doesn't mean that you should have to pass a Chinese culture exam in order to be allowed in a restaurant.

But if you're curious about Chinese cuisine, it's better if you don't have to jump through hoops to learn about it, or have to visit 10000 restaurants for 10 years, so you can learn about it through "trial and error".

---

edit:

The specific line your first comment cited from its parent is, emphasis mine:

>> people have said "don't roll your own crypto" - but really, that's stupid. What we need is accessible information to explain cryptography to people, so they know how to make smart crypto decisions

This is what I'm replying to. Easy access doesn't mean forcing anyone to learn anything.


then we would have to teach the world basic chemistry

When I was in high school, they did. Is that not the case anymore?


It is still taught from a chemical standpoint, but the second and third order consequences of ingestion are not really addressed. You might get a generic disclaimer somewhere saying "harmful to humans" here and there.


This isn't forcing people to learn crypto, it's providing resources so that they can. Obviously it's simpler to use crypto libraries.


The skills needed to write good crypto are different to those needed to use it. There is some overlap of course, but less than is implicated here.

Probably the best advice for using crypto is to use algorithms for what they were intended. Which will require some reading and the materials aren't as accessible as they ought to be.


Yeah I have mixed feelings about "don't roll your own crypto". On the one hand, I don't like discouraging learning. On the other hand, a culture of putting warnings in our README's like "I rolled this myself so it's probably broken and not suitable for production use" is a good thing.

If I could expand our traditional warning a little bit, I might say something like this:

If you roll your own crypto, you're going to make mistakes. That's fine; mistakes are how we learn. But crypto mistakes tend to be invisible, and your code will appear to work even when it has no security. So if you want to learn crypto, you need to find a way to get feedback about your invisible mistakes.


I wonder how the newer hashes, like Reinforced Concrete, that are used in Zero-Knowledge applications compare to SHA-256.

I mean can Reinforced Concrete replace SHA-256, as the primary hash, in a blockchain?


Hash functions are generally interchangeable except in their length and their speed.


and brokenness :)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: