It's a slight exaggeration of the information content to report the data size using an ASCII encoding. Since there are 4 bases, each can be encoded using 2 bits, rather than 8. So we're really talking 750 megabytes. But still mind-blowing.
And since the data is highly redundant the 750MB can be compressed down even further using standard approaches (DEFLATE works well, it uses both huffman coding and dictionary backreferences).
Or, you could build an embedding with far fewer parameters that could explain the vast majority of phenotypic differences. the genome is a hierarchical palimpsest of low entropy.
My standard interview question- because I hate leetcode- walks the interviewee through compressing DNA using bit encoding, then using that to implement a rolling hash to do fast frequency counting. Some folks get stuck at "how many bits in a byte", others at "if you have 4 symbols, how many bits are required to encode a symbol?", and other candidates jump straight to bloom filters and other probabilistic approaches
(https://github.com/bcgsc/ntHash and https://github.com/dib-lab/khmer are good places to start if you are interested).
I'm curious if these 750MB + the DNA of mitochondria + the protein metagenomics contain all the information needed to build a human, or if there's extra info stored in the machinery of the first cell.
That is if we transfer the DNA to an advanced alien civilization - would they be able to make a human.
This is a complex question. The cocktail soup in a gamete (sperm or egg) and the resulting zygote contains an awful lot of stuff that would be extremely hard to replace. I could imagine that if the receiving civilization was sufficiently advanced and had a model of what those cells contained (beyond the genomic information) they could build some sort of artificial cell that could bootstrap the genome to the point of being able to start the development process. it would be quite an accomplishment.
If they just received the DNA without some information about the zygote, I don't think it would be practical for even advanced alien civilization (LR5 or LR6) but probably an LR7 and definitely an LR8 could.
I’m just pondering this, and it’s not clear to me that there is anything intrinsic in the genome itself that explicitly’says’ “this sequence of DNA bases encodes a protein” or even “these three base-pairs equate to this amino acid”.
I wonder if that information could ever really be untangled by a civilisation starting entirely from scratch without access to a cell
If you knew what DNA was and had seen a protein you could easily figure out start/stop codons. If you had only seen something similar it would be harder. If you had nothing similar, I don't know.
Coding DNA and non-coding DNA looks very different. Proteins are full of short repetitive sequences that form structural elements like alpha helixes: https://en.wikipedia.org/wiki/Alpha_helix
Once you've identified roughly where the protein-coding genes are it would be trivial to identify 3'/5' as being common to all those regions. You could pretty easily imagine a much more complicated system with different transcription mechanisms and codon categories, but earth genomes are super simple in that respect. Once you have those you just have the (incredibly complex) problem of creating a polymerase and bam, you'll be able to print every single gene in the body.
Without the right balance of promoters/factors/polymerase you probably won't get anything close to a human cell, but you'd be able to at least work closer to what the natural balance should be, and once you get closer to building a correct ribosome etc the cell would start to self-correct.
It’s an interesting question. Naively, I would expect it to be about like reverse engineering a CPU from a binary program. Which sounds daunting but maybe not impossible if you understand the fundamentals of registers, memory, opcodes, etc.
But… doing so from first principles without a mental model of how all (human) CPUs work? I guess it comes down to whether the recipients had enough context to know what they’re looking at.
Yes, it's intrinsic in the genome but implemented through such a complicated mechanism that attempting to understand these things from first principles is impractical, not impossible.
In genomic science we nearly always use more cheaply available information rather than attempt to solve the hard problem directly. For example, for decades, a lot of sequencing only focused on the transcribed parts of the genome (which typically encode for protein), letting biology do the work for determining which parts are protein.
If you look at the process biophysically, you will see there are actual proteins that bind to the regions just before a protein, because the DNA sequences there match some pattern the protein recognizes. If you move that signal in front of a non-coding region, the apparatus will happily transcribe and even attempt to translate the non-coding region, making a garbage protein.
Since the cat is out of the bag, no, it's not a typo. it's related to Kardashev but is oriented around the common path most galactic civilizations follow on the path to either senescence (LR8.0) or singularity (LR8.1-4). Each level in LR is effectively unaware of the levels above it, basically because the level above is an Outside Context Problem.
Humans are currently LR2 (food security) and approaching LR3 (artificial general intelligence, self genetic modification). LR4 is generally associated with multiplanetary homing (IE, could survive a fatal meteor strike on the home planet) and LR5 with multisolar homing (IE, could survive a fatal solar incident). LR6 usually has total mastery of physical matter, LR7 can read remote multiverses, and LR8.2 can write remote multiverses. To the best of LR8's knowledge, there is no LR9, so far as their detectors can tell, but it would be hard to say, as LR9 implies existence in multiple multiverses simultaneously. Further, faster than light travel and time travel both remain impossible, so far as LR8 can tell.
“An Outside Context Problem was the sort of thing most civilisations encountered just once, and which they tended to encounter rather in the same way a sentence encountered a full stop.”
― Iain M. Banks, Excession
“Unbelievable. I’m in a fucking Outside Context situation, the ship thought, and suddenly felt as stupid and dumb-struck as any muddy savage confronted with explosives or electricity.”
― Iain M. Banks, Excession
“It was like living half your life in a tiny, scruffy, warm grey box, and being moderately happy in there because you knew no better...and then discovering a little hole in the corner of the box, a tiny opening which you could get your finger into, and tease and pull apart at, so that eventually you created a tear, which led to a greater tear, which led to the box falling apart around you... so that you stepped out of the tiny box's confines into startlingly cool, clear fresh air and found yourself on top of a mountain, surrounded by deep valleys, sighing forests, soaring peaks, glittering lakes, sparkling snow fields and a stunning, breathtakingly blue sky. And that, of course, wasn't even the start of the real story, that was more like the breath that is drawn in before the first syllable of the first word of the first paragraph of the first chapter of the first book of the first volume of the story.”
― Iain M. Banks, Excession
If we're at LR2, and each level is effectively unaware of the levels above it, how do we know what LR3/4/5/6/7/8/9 are or might be?
Or do you mean that a civilization at a particular level will always be unaware of civilizations above? That doesn't seem to make sense either; I see no reason why a LR4 civ couldn't have knowledge of a LR5 civ, for example.
Yes, but it currently requires developmentally mature individuals to build the gametes, and the "code" is so complex you couldn't really decipher it from first principles.
It would not necessary be possible, because it's incremental instructions on how to make the hardware, but based on already existing, unspecified and very complex, hardware. So the first instruction would be something like "take the stuff you have on your left and fuse it with the stuff you have on your right", both stuff being unspecified very complex protein assumed to be present.
Imagine a machine shop that has blueprints of components of the machines they use in the shop, and processes to assemble machines from the components. When a machine shop grows large and splits in two, each inherits a half of shop with the ongoing processes and a copy of the blueprints.
https://m.youtube.com/watch?v=B7PMf7bBczQ&pp=QAFIAQ%3D%3D
DNA is the blueprints.
There are infinite possibilities what to do with them. The advanced civilization would need additional information, like that they are supposed create a cell from the components to begin with, and a lot of detailed information how exactly to do that.
"if we transfer the DNA to an advanced alien civilization - would they be able to make a human."
I'm really surprised that in all these responses to your question no one's mentioned the womb or the mother, who (at least with current technology) is still necessary for making a human.
This is a question about theoretical possibilities and what you're saying seems to be a rigid belief in an answer "no". But you provided no evidence or justification, except for "with current technology", which answers nothing about the theoretical question.
It is known that that is not true, due to the distinct genetic code of mitochondria and known epigenetic influences of mothers on their children in utero.
You could say “well that's the last 10% of the details, maybe 90% is in the DNA,” but I think I would be suspicious that it's that high, because one of the things we know about humans is that we are born with all of the ova that we will ever have, rather than deferring the process until puberty. I should think that if it could be deferred it would have been, “you will spend the energy to make these 15 years before you need to for no real reason” seems very unlike evolution whereas “my body is going to teach you how to make these eggs, just the same as my mother's body taught me,” sounds quite evolutionarily reasonable.
> That is if we transfer the DNA to an advanced alien civilization - would they be able to make a human.
You'd need a cell to start the process, with the various nucleic acids distributed correctly and proteins/energy with which to create further proteins using the information encoded by the DNA. Thus the civilization would need information about cells and a set of building blocks before being able to use the DNA.
Including code for the proteins that read DNA to produce proteins. You might hit similar problems trying to understand C given the source code for a C compiler - a non-standard environment could reproduce itself given the source code, meaning the code alone doesn't strictly determine the output.
I'll torture this DNA and C source code analogy a bit.
Epigenetics is missing in this discussion about reproducing a human from just the DNA. These are superficial modifications (e.g. methylation, histone modification, repressor factors) to a strand of DNA that can drastically alter how specific regions get expressed into proteins. These mechanisms essentially work by either hiding or unhiding DNA from RNA polymerases and other parts of the transcription complex. These mechanisms can change throughout your lifetime because of environmental factors and can be inherited.
So it's like reading C source code, except there so many of these inscrutable C preprocessor directives strewn all throughout. You won't get a successful compilation by turning on or off all the directives. Instead, you need to get this similarly inscrutable configuration blob that tells you how to set each directive.
I guess in a way, it's like the weights for an ML model. It just works, you can't explain why it works, and changing this weight here produces a program that crashes prematurely, and changing a weight there produces a program with allergic reactions to everything.
There's also some interesting work on understanding the roll of loops in the physical structure of the DNA storage on gene expression. [0] The base sequence of the DNA isn't everything; it may also matter how the DNA gets laid out in space---a feature which can be inherited.
it's bit like- if i have source code of Linux (think DNA), can I build a machine running Linux? (think cell). no- you cant, you need to have machine that can run the code.
ie. "software" without "machine" to run it on, is kind of a useless.
Yes, and if you gzip it it's even smaller. But the big takeaway is that the amount of info that fully defines a human, is what we consider "not much data," even in its plainest encoding.
We don't know that it fully defines a human until we can create one without the starting condition of being inside another human. It's prototype-based inheritance.
Some of the research about being able to make simple animals grow structures from other animals in their evolutionary “tree” by changing chemical signaling—among other wild things like finding that memories may be stored outside the brain, at least in some animals—makes me think you need more than just the “code” to get the animal that would have been produced if that “code” were in its full context (of a reproductive cell doing all sorts of other stuff). Even if the dna contains the instructions for that reproductive cell, too, in some sense… which instructions do you “run”? There might be multiple possible variants, some of which don’t actually reproduce the animal you took the dna from.
My favorite trivia here is that flamingos aren't actually "genetically" pink but "environmentally" pink because they pick up the color from eating algae.
Except of course "genetics" and "environment" aren't actually separate things; sure, people's skin color isn't usually affected by their food, but only because most people don't eat colloidal silver.
AFAIK most poisonous frogs also aren’t “naturally” poisonous—they get it from diet. Ones raised in captivity aren’t poisonous unless you go out of your way to feed them the things they need to become poisonous.
bzip2 is marginally better, and then genome-specific compressors were developed, and then finally, people started storing individual genomes as diffs from a single reference, https://en.wikipedia.org/wiki/CRAM_(file_format)
Since genome files contain more data than just ATGC (typically a comment line, then a DNA line, then a quality score line), and each of those draws from a different distribution, DEFLATE on a FASTA file doesn't reach the full potential of the compressor because the huffman table ends up having to hold all three distributions, and the dictionary backlookups aren't as efficient either.
It turns out you can split the file into multiple streams, one per line type, and then compress those independently, with slightly better compression ratios, but it's still not great.