(with a wild coding language)
This is my first attempt at understanding the complexity of information within human DNA. I'll attempt to compare it to ways we describe technology (bits). Later I'll show how a direct measure may be like trying to describe irrational numbers with extra decimals (wrong way to think about it). To start, I'll be focusing on an estimate of biological information within the human genome, which may then be readily compared to technological storage.
I have yet to see a single section of code (software) that can replicate the storage (info compression) of DNA structurally
Technology is catching up (learning/converging?) fast though. Honest disclaimer: There's going to be wild handwaving, and back of the envelope calculations. If you'd prefer a scientific analysis on this topic you'll be dissappointed. What I sacrifice in rigor I hope to replace with accessibility to laymen like myself (basic bio background only, and an over active imagination).
This is one of those thought riffs where I wish I had full knowledge of my fiancé Michelle's molecular biology background. If she ever does read this post, I'm sure she'll be shaking her head sadly ;). For a quick and dirty background, I'm happy to refer you to a beginner's guide to genetics.
Kansas is going bye bye, so buckle up and prepare for a wild yet intellectually lazy ride to grasp at the very patterns of life!
As a "far out" frame of reference and inspiration, I look to Kevin Kelly's larger envelope of calculations. His extropy chapter is a lengthy read, and I've included a relevant snippet at the end of the post. Kevin's estimation of Extropy is spectacular in it's imaginative benchmarks. Mr. Kelly's writing is like a type of human programming. He gets thousands of folks excited, and we all dig into tiny fragments of his bold ideas. Alas, if only web programming was so. (github?)
For this discussion, I'll be focusing on the information contained within one human DNA strand (nucleic DNA, excluding mitochondrial, and other forms). Even in this single information storage mechanism, there is incredible complexity, and I'm certain my approximations will suffer due to a lack of complete knowledge of gene regulatory networks (more on that later).
I liken DNA's compression to a form of metaprogramming where earlier proteins and constructs beget greater complexity in the network (differentiation, feedback, competing construction/replication, fractal growth). Evolution makes all this possible with each generation, with gravity towards reoccurring emergent realizations (senses, nervous systems, etc). These are zones of stability in chaos theory.
How many bits are in a strand of human DNA?
Let's first identify the total number of base pairs and active protein coding genes. The wiki on the human genome was very helpful in providing these figures, which are good enough to generate ballpark estimates.
The human genome is the genome of Homo sapiens, which is stored on 23 chromosome pairs. Twenty-two of these are autosomal chromosome pairs, while the remaining pair is sex-determining. The haploid human genome occupies a total of just over 3 billion DNA base pairs. The Human Genome Project (HGP) produced a reference sequence of the euchromatic human genome, which is used worldwide in biomedical sciences.
The haploid human genome contains ca. 23,000 protein-coding genes, far fewer than had been expected before its sequencing. In fact, only about 1.5% of the genome codes for proteins, while the rest consists of non-coding RNA genes, regulatory sequences, introns, and (controversially named) "junk" DNA.
Consider the language nature uses yields a 2bits*3e9, 6 gigabits or ~750 mega bytes assuming unrestricted combinations (there are restrictions). The human genome is one subset of this upper bound on possibilities.
To begin I've found some quick calculations done by Andrew Yates (thanks Andrew) that show the number to be under 735 mega bytes. After sharing his caluculations, I'll discuss what components I believe increase the information content far greater than this data estimate.
How much data IS a human genome?
2 bits per base (4 bases = 22)
3,080.4 Mb per human genome 
700 MB per CD-ROM
(1 human genome) *
(3,080,400,000 bases / 1 human genome) *
(2 bits / 1 base) *
(1 byte / 8 bits) *
(1 MB / 1,048,576 bytes) =
734.4 MB per uncompressed human genome. Easily enough to fit on a 700 MB with basic file compression like gzip.
A plausible and defendable estimate of genomic raw data and it backs up my first wag at bio data.
After some reading I came across another intrepid soul attempting to bound the depth of genomic data. The Alchemist suggests variations which greatly increase the complexity.
This question has a simple answer and a complicated one. The calculation starts like this : 3 billion base pairs - each of which can be one of four (A,C,G,T). So biology uses base 4 not base two - the result is around the 6 billion bit mark and comparable to the space on a hard drive.
However, this is where the analogy with computers breaks down. The genome's 'bits' can be further modified by methylation and acetylation. Considering only methyl groups, this means that each base can take on eight, not four, possible forms. Although this would seem to double the amount of information it doesn't work like that. Such meta-data is used for various other purposes, like chromatin assembly. Simply put, this is a little like data archiving or compression - although the reality is much more complex.
A keen commenter Brazil adds additional qualifying information:
The Alchemist used a very good metaphor by comparing the genome with a hard drive, however it goes actually much further: how much of the information actually resides on the disk, and how much of it is in the drive mechanics, electronics and firmware?
Recent research results suggest that a significant part of the information that makes our bodies work the way they do is not contained in the DNA itself, but rather in the metabolism of the cells that contain it, which is is not only necessary to interpret the information, but actually modifies it and offers "added value" (like, for example, the cache in a hard drive).
Brazil is getting the cruxt of the information. In the simple estimate, there is no mention of active genes, which would at first appear to greatly reduce the complexity of the genome. We have 1.5% active basepairs, so this would imply only 11 megabytes of information using Andrew's framework.
Hold on a moment. There are 23,000 active gene sites. Each gene can interact and influence other genes in the network. Instead of thinking about the human genome as bits of data, it is more fitting to think of DNA as a dynamic library for generating genes. These genes code for proteins which feedback and allow for construction of new genes (similar to metaprogramming).
In order to get a feel for the information in the human genome we should consider the complexity of our gene regulatory network. For that I turned to a paper on modelling gene regulatory networks by Shoudan Liang, Stefanie Fuhrman, and Roland Somogyi.
In it there is a reference to Shannon's information where maximal entropy (information) occurs when elements in the system are equiprobable. Each gene is represented by a binary state, either it's on or off. So the total possible number of states are 223,000 (23k bit vector) coded genes, if they're all zero none are actively coded, all 1 and they're all on. But the states progress over time, are a function of previous state, and external conditions. We require a different model to frame information within DNA.
A Complex Adaptive System (CAS) is a dynamic network of many agents (which may represent cells, species, individuals, firms, nations) acting in parallel, constantly acting and reacting to what the other agents are doing. The control of a CAS tends to be highly dispersed and decentralized. If there is to be any coherent behavior in the system, it has to arise from competition and cooperation among the agents themselves. The overall behavior of the system is the result of a huge number of decisions made every moment by many individual agents.
A CAS behaves/evolves according to three key principles: order is emergent as opposed to predetermined (c.f. Neural Networks), the system's history is irreversible, and the system's future is often unpredictable. The basic building blocks of the CAS are agents. Agents scan their environment and develop schema representing interpretive and action rules. These schema are subject to change and evolution.
Over time (generations) the human genome can increase in complexity. The upper bound of information within an evolving gene regulatory network is therefore unknown. Said another way, even if we quantize the bits in our DNA today, there is no direct measure of information potential for the systems future information. Perhaps it's better to characterize the rate of increase of DNA information per generation?
Given the above charlatan's analysis, although I'm agnostic: I now believe nature, evolution or God is most certainly a Hacker.
- Beginner's guide to genetics
- Human genome wiki
- Human chromosome
- genetic code
- Gene regulatory networks
An excerpt from Kevin Kelly's Extropy.
This is more clearly seen at the extreme. The difference between four bottles of amino acids on a laboratory self and the four amino acids arrayed in your chromosomes lies in the additional structure, or ordering, those atoms get from participating in the spirals of your replicating DNA. Same atoms, more order. Those atoms of amino acids acquire yet another level of structure and order when their cellular host undergoes evolution. As organisms evolve, the informational code their atoms carry is manipulated, processed, and reordered. In addition to genetic information, the atoms now convey adaptive information. Over time, the same atoms can be promoted to new levels of order. Perhaps their one cell home joins another cell to become multicellular — that demands the informational architecture for a larger organism as well as a cell. Further transitions in evolution — the aggregation into tissues and organs, the acquisition of sex, the creation of social groups — continue to elevate the order and increase the structure of the information flowing through those same atoms.
Later Kevin stretches farther than I expected:
For four billion years evolution has been accumulating knowledge in its library of genes. You can learn a lot in four billion years. Every one of the 30 million or so unique species of life on the planet today is an unbroken informational thread that traces back to the very first cell. That thread (DNA) learns something new each generation, and adds that hard-won knowledge to its code. Geneticist Motoo Kimura estimates that the total genetic information accumulated since the Cambrian explosion 500 million years ago is 10 megabytes per genetic lineage. Now multiply the unique information held by every individual organism by all the organisms alive in the world today and you get an astronomically large treasure. Imagine the Noah's Ark that would be needed to carry the genetic payload of every organism on earth (seeds, eggs, spores, sperms). One study estimated the earth harbored 10^30 single-cell microbes. A typical microbe, like a yeast, produces one one-bit mutation per generation, which means one bit of unique information for every organism alive. Simply counting the microbes alone (about 50% of the biomass), the biosphere contains 10^30 bits, or 10^29 bytes, or 10,000 yottabyes of genetic information. That's a lot.
And that is only the biological information. The technium is awash in its own ocean of information. Measured by the amount of digital storage in use, the technium today contains 487 exabytes (10^20) of information, many orders smaller than nature's total, but growing. Technology expands data by 66% per year, overwhelming the growth rates of any natural source. Compared to other planets in the neighborhood, or to the dumb material drifting in space beyond, a thick blanket of learning and self-organized information surround this orb.
Related articles by Zemanta
- A brief guide to DNA sequencing (arstechnica.com)
- Beyond the Genome (wired.com)
- Get Your DNA Sequenced for $200K Less! (blisstree.com)
- Some Parts of Human Genome Get Lost (nlm.nih.gov)
- 'Junk DNA' yields gold in genome map to help fight disease (healthzone.ca)
- Complete Genomics produces a cheap - well, $5,000 - human genome (arstechnica.com)