In my last posting on SoftwareBiology, I ended with the observation that there were a great number of similarities between biological and computer software and alluded to the possibility that this similarity could have arisen from both belonging to a higher category of entities that face a commonality of problems with the second law of thermodynamics and nonlinearity. That will be the subject of this posting, which will deal with one of the oddest things in the physical Universe – self-replicating information in the form of living things, Richard Dawkins’ memes, and software. This posting will not make much sense if you have not read SoftwareBiology and learned of Richard Dawkins’ concept of living things as DNA survival machines, so I would recommend reading it before proceeding.
Self-Replicating Information – Information that persists through time by making copies of itself or by enlisting the support of other things to ensure that copies of itself are made.
Most forms of information, left to their own devices, simply degrade into total disorder and maximum entropy as the second law of thermodynamics relentlessly whittles away at them. Just picture in your mind what happens to the information encoded on a disk drive over a period of 10,000 years. But what if there were some kind of information that could beat out the second law of thermodynamics by constantly making copies of itself, so that as each disk drive wore out, many more copies took its place? Such a form of self-replicating information would quickly out compete and overwhelm other forms of non-replicating information on disk drives and come to dominate. Actually, we now call such self-replicating forms of information a computer virus. But computer viruses and other forms of software are just the latest wave of self-replicating information on this planet. Billions of years before the arrival of software, living things emerged from a soup of organic molecules as the first form of self-replicating information, and about 200,000 years ago, memes, or self-replicating cultural artifacts emerged in the minds of Homo sapiens.
To summarize, over the past 4.5 billion years there have been three waves of self-replicating information on this planet:
1. Living things beginning about 4.0 billion years ago
2. Memes beginning about 200,000 years ago
3. Software beginning in the spring of 1941 on Konrad Zuse’s Z3 computer
For those of you not familiar with the term meme, it rhymes with the word “cream”. Memes are cultural artifacts that persist through time by making copies of themselves in the minds of human beings and were first recognized by Richard Dawkins in The Selfish Gene (1976). Dawkins described memes as “Examples of memes are tunes, ideas, catch-phrases, clothes fashions, ways of making pots or of building arches. Just as genes propagate themselves in the gene pool by leaping from body to body via sperms or eggs, so memes propagate themselves in the meme pool by leaping from brain to brain via a process which, in the broad sense, can be called imitation.”. Just as genes come together to build bodies, or DNA survival machines, for their own mutual advantage, memes also come together from the meme pool to form meme-complexes for their own joint survival. DNA survives down through the ages by inducing disposable DNA survival machines, in the form of bodies, to produce new disposable DNA survival machines. Similarly, memes survive in meme-complexes by inducing the minds of human beings to reproduce memes in the minds of others. To the genes and memes, human bodies are simply disposable DNA survival machines housing disposable minds that come and go with a lifespan of less than 100 years. The genes and memes, on the other hand, continue on largely unscathed by time as they skip down through the generations. However, both genes and memes do evolve over time through the Darwinian mechanisms of innovation and natural selection. You see, the genes and memes that do not come together to build successful DNA survival machines or meme-complexes are soon eliminated from the gene and meme pools. So both genes and memes are selected for one overriding characteristic – the ability to survive. Once again, the “survival of the fittest” rules the day. Now it makes no sense to think of genes or memes as being either “good” or “bad”; they are just mindless forms of self-replicating information bent upon surviving with little interest in you as a disposable survival machine. So in general, these genes and memes are not necessarily working in your best interest, beyond keeping you alive long enough so that you can pass them on to somebody else. That is why, if you examine the great moral and philosophical teachings of most religions and philosophies, you will see a plea for us all to rise above the selfish self-serving interests of our genes and memes.
Meme-complexes come in a variety of sizes and can become quite large and complicated with a diverse spectrum of member memes. Examples of meme-complexes of increasing complexity and size would be Little League baseball teams, clubs and lodges, corporations, political and religious movements, tribal subcultures, branches of the military, governments and cultures at the national level, and finally the sum total of all human knowledge in the form of all the world cultures, art, music, religion, and science put together. Meme-complexes can do wonderful things, as is evidenced by the incredible standard of living enjoyed by the modern world, thanks to the efforts of the scientific meme-complex, or the great works of art, music, and literature handed down to us from the Baroque, Classical, and Romantic periods, not to mention the joys of jazz, rock and roll, and the blues. However, meme-complexes can also turn incredibly nasty. Just since the Scientific Revolution of the 17th century we have seen the Thirty Years War (1618 -1648), the Salem witch hunts (1692), the French Reign of Terror (1793 – 1794), American slavery (1654 – 1865), World War I (all sides) (1914 – 1918), the Stalinist Soviet Union (1929 – 1953), National Socialism (1933 – 1945), McCarthyism (1949 – 1958), Mao’s Cultural Revolution (1969 – 1976), and Pol Pot’s reign of terror (1976 – 1979).
The problem is that when human beings get wrapped up in a meme-complex, they can do horrendous things without even being aware of the fact. This is because in order to survive, the first thing that most meme-complexes do is to use a meme that turns off human thought and reflection. To paraphrase Descartes ”I think, therefore I am" a heretic. So if you questioned any of the participants caught up in any of the above atrocious events, you would find that the vast majority would not have any qualms about their deadly activities whatsoever. In fact, they would question your loyalty and patriotism for even bringing up the subject. For example, during World War I, which caused 40 million casualties and the deaths of 20 million people for apparently no particular reason at all, there were few dissenters beyond Albert Einstein in Germany and Bertrand Russell in Great Britain, and both suffered the consequences of not being on board with the World War I meme-complex. Unquestioning blind obedience to a meme-complex through unconditional group-think is definitely a good survival strategy for any meme-complex. But the scientific meme-complex has an even better survival strategy – skepticism and scrutiny. Using skepticism and scrutiny may not seem like a very good survival strategy for a meme-complex because it calls into question the validity of the individual memes within the meme-complex itself. But that can also be a crucial advantage. By eliminating memes from within the scientific meme-complex that cannot stand up to skepticism and scrutiny, the whole scientific meme-complex is strengthened, and when this skepticism and scrutiny are turned outwards towards other meme-complexes, the scientific meme-complex is strengthened even more so. There will be more on this in my next posting.
Another problem with meme-complexes is that, like DNA survival machines, they are usually very conservative when it comes to admitting in new member memes, just as DNA survival machines usually find new mutated genes less than welcome. This is due to the second law of thermodynamics and nonlinearity, most new member memes or genes prove to be very detrimental; they put the DNA or meme survival machine at risk. But this frequently leads to inbreeding of thought within a meme-complex. Yes, we are always admonished to “think outside of the box”, at least until we actually try to do so, but “thinking outside of the box” is usually frowned upon by most meme-complexes, even the scientific meme-complex. One way to overcome this inbreeding of thought is through the cross-fertilization of memes from other meme-complexes. Bringing in foreign memes from other meme-complexes that are well accepted in those meme-complexes, helps to reduce the dubious nature of a completely new meme. However, even this is usually met with great resistance. Basically, this is what I have been trying to do with softwarephysics for nearly 30 years by bringing into IT memes from physics, biology, and chemistry. When I left physics in 1973 to become a geophysicist, I was greatly impressed by the cross-fertilization of ideas from geology and physics which led to the effective theory of plate tectonics in the 1960s. Neither meme-complex could have developed plate tectonics on its own.
As an IT professional, I assume that you already know more than enough about software, but in this posting we will also examine software as a form of self-replicating information.
The Characteristics of Self-Replicating Information
All the above forms of self-replicating information have some common characteristics.
1. All self-replicating information evolves over time through the Darwinian processes of innovation and natural selection, which endows self-replicating information with one telling characteristic – the ability to survive in a Universe dominated by the second law of thermodynamics and nonlinearity.
2. All self-replicating information begins spontaneously as a parasitic mutation that obtains energy, information and sometimes matter from a host.
3. With time, the parasitic self-replicating information takes on a symbiotic relationship with its host.
4. Eventually, the self-replicating information becomes one with its host through the symbiotic integration of the host and the self-replicating information.
5. Ultimately, the self-replicating information replaces its host as the dominant form of self-replicating information.
6. Most hosts are also self-replicating information.
7. All self-replicating information has to be a little bit nasty in order to survive.
Since living things were the first form of self-replicating information on this planet, the origin and evolution of living things is the archetype for all the other forms of self-replicating information. We shall begin there.
The Origin of Life
In SofwareBiology we saw that living things largely consist of two flows, a flow of energy and a flow of information. The flow of energy is called metabolism, which provides the energy necessary to overcome the second law of thermodynamics. In SoftwareChemistry we discussed how the Krebs cycle converts the energy in carbohydrates into ATP, which is then used to drive all biochemical reactions that require some free energy to proceed. Figure 10 of SoftwareChemistry depicts the very complicated metabolic do-loop that is the Krebs cycle. More succinctly, the Krebs cycle converts pyruvate to CO2 and produces reducing energy in the form of NADH and FADH2 and phosphorylated energy in the form of GTP.
2 pyruvate + 2 GDP + 2 H3PO4 + 4 H2O + 2 FAD + 8 NAD+ ----> 6 CO2 + 2 GTP + 2 FADH2 + 8 NADH
The NADH and FADH2 can be used to generate ATP, using an electron transport chain in the presence of oxygen, and GTP can be easily converted into ATP in one simple reaction. The net result is the generation of about 38 molecules of ATP per cycle.
The ultimate source of all information flow in living things is the transcription of DNA genes into proteins, which roughly goes as:
DNA + RNA polymerase + ATP -> mRNA + Ribosomes + tRNA + ATP -> polypeptide chain -> protein
This is a genetic flow of information. The problem is that both of the above reactions require enzyme proteins. So we have a very difficult chicken and egg problem here. In order to conduct metabolic reactions, we need enzyme proteins, and to create enzyme proteins from genetic information, we also need enzyme proteins and the energy from metabolic reactions. So for the origin of life, which came first – genetic information or enzymes? And how could either have come first if they both depend upon each other?
Much of what follows comes from Origins of Life (1999) by Freeman Dyson, another one of my favorite physicists and authors. This is a wonderfully succinct book, a mere 100 pages long, that I believe is a marvelous example of the cross-fertilization of memes from physics into the meme-complex of biology. Many times fresh memes from outside a discipline are required when its meme-complex gets “stuck” on a problem like the origin of life. This was definitely true of geology in the early 1960s. By that time, geologists had done a marvelous job at figuring out what had happened over the past billion years of geological time, but they could not figure out why things had happened. By mapping outcrops and road cuts, geologists were able to see mountains rise from the sea over the course of tens of millions of years, only to be later eroded down to flat plains over the course of hundreds of millions of years, and they saw massive volcanic eruptions like the Deccan Trapps covering 500,000 square miles of India to a depth of 6,000 feet, and there were the ever present earthquakes and volcanoes to deal with too. But by the early 1960s, the geologists were stuck, they could not figure out what was going on. It took the cross-fertilization of some memes from physics with some memes from geology to form the new science of geophysics in the 1950s. The end result was the theory of plate tectonics which finally supplied the answer. It turns out that the Earth is covered by a series of very large plates, moving about as fast as your fingernails grow. Mountains form when these plates collide, like a car accident in slow motion, slowly crumpling the hoods of cars.
Dyson points out that currently there are three competing theories for the origin of life:
1. Metabolism came first. This theory was first proposed by Russian biochemist Alexander Oparin in The Origin of Life (1924). Oparin proposed that primitive cell-like structures came first, followed by molecules with catalytic properties similar to enzymes, and finally genetic information stored in genes. This was way before the structure of DNA was revealed in 1953 by James Watson and Francis Crick, so naturally Oparin focused on something that he did know about – organic chemistry. Oparin proposed that the early Earth had a reducing atmosphere, without the presence of oxygen, so large organic molecules could naturally form without being oxidized as they would be in today’s atmosphere. Actually, the Universe is just chock full of organic molecules, which are found in the large molecular clouds out of which stars form, the atmospheres of most of the planets in the Solar System, meteorites that have struck the Earth, and in the tails of comets. Oparin noted that when oily substances composed of organic molecules, like the phospholipids previously discussed, are agitated in water, they naturally form spherical cell-like structures similar to the membranes of cells. Oparin proposed that organic molecules trapped within these cell-like structures would, through innovation and natural selection, slowly begin to compete for the monomers necessary to perform complex organic chemical reactions, and would thus form a primitive metabolism. Continued innovation and natural selection would lead to primitive catalytic enzyme proteins forming within these proto-cells, and ultimately, some way to store this information in genes. Again, in 1924 nobody had a clue as to how genetic information was stored in genes, so Oparin’s metabolic theory had to be necessarily vague about this final step.
2. RNA came first. This theory was proposed by Manfred Eigen in 1981 and asserts that genes came first, stored as RNA instead of DNA, enzyme proteins came second, and lastly cells. Eigen’s theory is based upon the fact that RNA is a very dynamic molecule, as we saw in SoftwareBiology. RNA can both store genetic information and also perform primitive catalytic functions similar to enzyme proteins all at the same time, reducing the chicken and egg problem to a much simpler chicken/egg problem in which both the chicken and the egg appeared simultaneously. Also, as the structure of DNA and RNA became known in the 1950s and 1960s, it was realized that RNA is a much simpler structure than the structure of protein molecules, which are composed of a chain of 20 amino acids, so it was naturally thought that RNA likely preceded the proteins. Eigen proposed that the early Earth contained a population of RNA nucleotides A, C, U, and G, which randomly came together to form a rudimentary self-replicating form of RNA. This “RNA World” would be subject to Darwin’s principles of innovation and natural selection, which would select for strings of self-replicating RNA that were better at replicating than other strings of self-replicating RNA. Eigen called these early RNA replicators a quasi species, consisting of a population of similar, but not identical, self-replicating forms of RNA, like the genetically variable members of a real species. The members of a quasi species would compete with each other and evolve, just like the members of a real species. Eigen then proposed that certain quasi species came together in a cooperative association with a group of associated enzyme proteins called a hypercycle. The quasi species of RNA and the enzyme proteins in the hypercycle formed a self-sustaining coalition, essentially an early survival machine, locked in a stable equilibrium.
Dyson points out that there are several weaknesses in the “RNA World” theory. First, although it has been found incredibly easy to find the amino acid constituents of proteins in the physical Universe in interstellar molecular clouds or on the surface of many planets, making the A, C, U and G nucleotides that form RNA is much more difficult. You can easily make amino acids from simpler molecules like water, methane, ammonia, and hydrogen as did Stanley Miller in 1954 at the University of Chicago, but not so for the A, C, U and G nucleotides of RNA. Secondly, there is the “error catastrophe” familiar to all programmers. The self-replicating processes of the “RNA world” would need to be both very accurate and very simple at the same time. Dyson shows that experimental work with RNA reveals that self-replicating RNA has a replication error of at least 1%, which implies a maximum length of about 100 nucleotides for a self-replicating form of RNA that does not rapidly mutate beyond all recognition after a few iterations. Now 100 bits of information is really not enough to code for a viable enzyme protein, so it is hard to see how a hypercycle could form a stable equilibrium with such high error rates. It would be like finding a computer language that allows for a 1% error rate in coding, but still produces executables that work OK.
Regardless of its limitations, the “RNA World” theory for the origin of life is currently the most favored, largely due to the huge achievements that have been made in molecular biology over the past 50 years that have uncovered all of the very impressive processes that DNA and RNA manage to perform with the aid of enzymes. Consequently, the Oparin theory of "metabolism first" has fallen by the wayside.
3. Something else came first. An example is Alexander Graham Cairns-Smith’s theory popularized in his book Seven Clues to the Origin of Life (1985), in which he outlined a theory he had been working on since the mid-1960s. In this theory, there is a clay precursor to both RNA and metabolism. Clay microcrystals contain an irregular array of ionic sites to which metals, such as magnesium and aluminum, can bind. Thus, clay microcrystals can carry an irregular pattern of electrical charges, similar to the pattern of electrical charges on the sidechains of an RNA molecule. Cairns-Smith posited that, just as RNA can store genetic information that self-replicates and also perform limited catalytic operations on organic molecules, clay microcrystals could do the same. The irregular pattern of the electrically charged sites of one clay microcrystal would form a template for another clay microcrystal to form upon and thus replicate. Similarly, the exposed electrically charged sites of the clay microcrystals would also be able to perform limited catalytic operations on organic molecules, just like RNA. Thus, in this model we simply replace RNA with clay microcrystals as the first replicator. The chief advantage of this model is that there is plenty of clay to go around to form the primitive “Clay World” and we do not have to worry about how the RNA nucleotides came to be in sufficiently large concentrations to make the “RNA World” possible. The self-replicating clay microcrystals would then form Eigen’s quasi species of competing self-replicating clay microcrystals, which later came together in Eigen’s hypercycle of cooperating enzyme proteins and clay microcrystals. Again, the quasi species of clay microcrystals and the enzyme proteins in the hypercycle form a self-sustaining coalition, essentially an early survival machine, locked in a stable equilibrium. The hypercycle then seeks refuge in already existing phospholipid membranes for protection to form primitive proto-cells. Eventually, one of these proto-cells discovered that RNA was much better at self-replicating and enzyme-like activities than clay microcrystals. This might have occurred as the clay microcrystals formed a scaffolding upon which the early forms of RNA could cling to. In this model, clay came first, followed by enzymes, then cells, and finally RNA. The main drawback to this theory is that, unlike for RNA, there is no experimental evidence showing that clay microcrystals can actually self-replicate or conduct catalytic operations on organic molecules. What Dyson does like about Cairns-Smith’s clay-based theory is that it has the origin of life take place in two steps, rather than in one step, as do Oparin’s metabolic theory and Eigen’s “RNA World” theory. First there is a clay-based form of life that is later replaced by an RNA–based form of life.
Dyson then does a brilliant intellectual cross-fertilization, by infusing in a meme from Lynn Margulis, to form a new two-step theory for the origin of life. In 1966, Lynn Margulis submitted a paper The Origin of Mitosing Eukaryotic Cells which was rejected by about 15 scientific journals, again demonstrating the very conservative nature of meme-complexes and their tendency to reject new memes, even new memes with merit. The paper was finally published in The Journal of Theoretical Biology and is now considered the seminal paper on the endosymbiotic theory for the origin of eukaryotic cells. Recall that bacteria are prokaryotic cells, with very little internal structure, like the spaghetti-code programs of the 1960s. Eukaryotic cells, on the other hand, are huge cells with about 10,000 times the volume of a typical prokaryotic cell. Because eukaryotic cells are so large, they have an internal cytoskeleton, composed of linear shaped proteins that form filaments that act like a collection of tent poles, to hold up the huge cell membrane encircling the cell. Eukaryotic cells also have a great deal of internal structure, in the form of organelles, that are enclosed by internal cell membranes. Eukaryotic cells divide up functions amongst these organelles, like the structured programs of the 1970s and 1980s. These organelles include the nucleus to store and process the genes stored in DNA, mitochondria to perform the Krebs cycle to create ATP from carbohydrates, and chloroplasts in plants to produce energy rich carbohydrates from water, carbon dioxide, and sunlight. The great mystery was how could such complexity arise from simple prokaryotic cells? Margulis brought in a common theme from evolutionary biology that explains how seemingly impossible complexity can arise from simpler parts. What happens is that organisms develop a primitive function for one purpose, through small incremental changes, and then discover, through serendipity, that this new function can also be used for something completely different. This new use will then further evolve via innovation and natural selection. For example, we have all upon occasion used a screwdriver as a wood chisel in a pinch. Sure the screwdriver was meant to turn screws, but it does a much better job at chipping out wood than your fingernails, so in a pinch it will do quite nicely. Now just imagine Darwin’s processes of innovation and natural selection at work selecting for screwdrivers with broader and sharper blades and a butt more suitable for the blows from a hammer, and soon you will find yourself with a good wood chisel. At some distant point in the future, screwdrivers might even disappear for the want of screws, leaving all to wonder how the superbly adapted wood chisels came to be.
As an IT professional, you probably do this all the time. How often do you write code from scratch? I know that I never do. I simply find the closest piece of existing code that I have on hand and then turn the screwdriver into a wood chisel through small incremental changes to the code, by testing each small change to see how closely my screwdriver has evolved towards being a wood chisel. And I think that most of us also code using this Darwinian process of innovation and natural selection too. I am a rather lazy programmer, so many times rather than thinking through a new chunk of code during the iterative process of coding and testing, I will simply make an “educated guess” at the new code to be introduced. After 35 years of coding, you begin to code by “ear”. Many times, I can fall upon the correct code after a few shots of directed random change, and that sure beats racking your brain over new code. Surprisingly, sometimes I even come up with “better” code through this Darwinian process than if I sat down and carefully thought it all through. This has probably been going on since 1945, when Konrad Zuse wrote the first “Guten Tag Welt!” program in Plankalkuel – just speculating here on the origin of the compulsion for all programmers, new to a computer language, to write the obligatory “Hello World!” program as their first effort. So the basic idea of grabbing some old code or architectural design elements from a couple of older Applications and slowly modifying them through an iterative process of innovation and natural selection into a new Application is no stranger to IT. As Simon Conway Morris commented in Life’s Solution (2003) "How much of a complex organism, say a humanoid, has evolved at a much earlier stage, especially in terms of molecular architecture? In other words, how much of us is inherent in a single-celled eukaryote, or even a bacterium? Conversely, we are patently more than microbes, so how many genuinely evolutionary novelties can we identify that make us what we are? It has long been recognized that evolution is a past master at co-option and jury-rigging: redeploying existing structures and cobbling them together in sometimes quite surprising ways. Indeed, in many ways that is evolution”. When I first read these words, I accidentally misread the quote as "Indeed, in many ways that is IT”.
The endosymbiotic theory of Lynn Margulis solves the problem of the extreme complexity of eukaryotic cells in a similar fashion. Margulis proposed that the organelles found within eukaryotic cells, such as mitochondria and chloroplasts, actually started out as free floating prokaryotic bacteria themselves. These bacteria invaded somewhat larger proto-eukaryotic cells as parasitic bacteria, so these organelles actually began as a disease! These disease bearing bacteria probably killed most of the early proto-eukaryotic cells, but through natural selection, some of them developed a tolerance to the invaders. Tuberculosis bacteria still do this in the human body today. They will invade macrophage cells in the human body, which normally digest invading bacteria. However, the macrophages cannot digest the tough cell walls of the tuberculin bacteria. Instead, the tuberculosis bacteria reproduce within the macrophages causing them to swell. Over time, these chronic parasitic bacteria began to form a symbiotic relationship with their host proto-eukaryotic cells. In the case of the mitochondrial bacteria, living inside a host with plenty of food in its cytoplasm was much better than earning a living on the outside, and who cares if the host began to use some of the ATP that leaked out of the mitochondrial bacteria? The same goes for the photosynthetic cyanobacteria that could make carbohydrates and oxygen from sunlight, carbon dioxide and water. Having an internal source of carbohydrates was a sure advantage for the host proto-eukaryotic cells, which no longer had to hunt for such molecules, and living inside the protective coating of a host was beneficial to the cyanobacteria as well.
One of the key pieces of evidence supporting the endosymbiotic theory is that both mitochondria and chloroplasts have their own DNA, in addition to the DNA found in the nucleus of eukaryotic cells. Granted, the amount of DNA within mitochondria and chloroplasts is much less than the amount within the nucleus of a eukaryotic cell, but it is difficult to explain where this DNA came from, if it did not come from an invading bacterium. It is thought that much of the DNA within the invading bacteria eventually ended up within the nucleus of the proto-eukaryotic cells. This benefited the invaders, since they did not have to deal with the overhead of storing the DNA and transcribing the DNA into proteins. They left that job for the host eukaryotic cells. For the host, removing some genes from the invaders was beneficial in reducing their tendency to over replicate; like taking away the car keys from a teenager as a form of birth control. Also, mitochondria still behave, in many ways, like autonomous bacteria within your cells. Unlike the rest of your body, the mitochondria in your cells are direct descendants from the mitochondria that were in your mother’s egg cell. Each time one of your cells divides, the mitochondria in the cell reproduce themselves just before the cell divides, and half of the maternal mitochondria end up in each of the two daughter cells. Thus, there is an unbroken chain of mitochondria going back through your maternal line of descent. Because the genes in mitochondria only come from your maternal mitochondria, without the messy mixing of genes in the chromosomal crossovers of sexual reproduction, they make great forms of self-replicating information for tracking the mutation rates of DNA over time or for following the migration of DNA survival machines across the face of the Earth.
The parasitism of the endosymbiotic theory may sound a bit strange to a person from the modern world, since thanks to science, we are largely free of parasites. However, for much of humanity and certainly the bulk of the animals and plants of the world, parasitism is the rule. Most creatures have always been loaded down with a large number of parasitic worms, protists, rickettsia, flukes, fleas, ticks, mosquitoes, and chiggers. The aim of these parasites is not to kill the host because that puts an end to their meal ticket. For genes in a parasitic DNA survival machine, the successful strategy is to gain as much benefit from the host as is possible without killing the host. This small concession to altruism on the part of the parasite, frequently ends up with the parasite and host forging a symbiotic relationship, in which each benefits from the other. For example, your gut is loaded with about 3.5 pounds of symbiotic bacteria which perform useful digestive functions for your body at a minimal cost, and in turn, you provide food and a safe haven for the symbiotic bacteria. The endosymbiotic theory augments Darwin’s pillars of innovation and natural selection with the added driving forces of parasitism and symbiosis.
We also see these same effects in economics. In a capitalistic economy, most businesses are not in direct competition with each other. Instead, they form parasitic and symbiotic relationships called supply chains. For example, my wife and I like to go to plays and concerts in the Chicago area, and there is a company, which will remain nameless, that has taken on a parasitic/symbiotic relationship with just about every venue in the Chicago area. It seems that the only way to get tickets to plays and concerts in the Chicago area is to either physically drive to the box office or use the services of this parasitic/symbiotic business partner and enjoy the multiple “service and convenience fees”, which tack on about a 20% surcharge to the cost of the tickets. Although I cringe each time I purchase tickets on the website of this parasitic/symbiotic business, I do realize that all the participating parties benefit from this parasitic/symbiotic relationship, including myself. The venues do not have to print and mail tickets or host an interactive website, which would be quite inefficient for such small volumes of tickets, and I do not have to drive to box offices, and we all know how the parasitic/symbiotic business benefits. There will be more on this theory of symbiogenesis of Lynn Margulis as it has pertained to the evolution of software in a future posting on Software Symbiogenesis.
Dyson’s Theory of the Origins of Life
In the remainder of Origins of Life Dyson describes his working hypothesis for a two-step origin of life. In Dyson’s view, life begins as a series of metabolic reactions within the confines of a phospholipid membrane container, just as Oparin hypothesized. The key advantage of having a number of metabolic pathways form within a phospholipid membrane container, is that it is a form of self-replicating information with a high tolerance for errors. Think of these reactions as a large number of metabolic do-loops, like scaled down Krebs cycles, processing organic molecules. These metabolic do-loop reactions replicate when the phospholipid membrane container grows to the point where physical agitation from waves cause the phospholipid membrane containers to divide, with roughly equal portions of metabolic do-loop reactions going into each new container. Of course, some of the daughter proto-cells will luck out and receive “better” metabolic do-loop reactions than others, and these proto-cells will have a greater chance of passing on these “better” metabolic do-loop reactions to their offspring. This circumvents the “error catastrophe” of the “RNA World” hypothesis because these metabolic do-loop reactions are more tolerant of errors than is RNA. As usual, the second law of thermodynamics is both good and bad. It is good in that it allows for innovative variations within the gene pool of proto-cell metabolic do-loop reactions, but it is harmful when nonlinearity comes into play, as it does for RNA and source code. A small change to an RNA replicator, like a small change to source code, usually has large unpredictable and usually fatal effects due to the nonlinear nature of both RNA and source code. A large number of metabolic do-loop reactions, on the other hand, behave in a more linear manner, so that small changes cause small effects. It is this linear response to small changes in the metabolic do-loop reactions that makes them much more forgiving of error and avoids the “error catastrophe” of the RNA world.
In fact, Dyson imagines that these early proto-cells, with their large number of metabolic do-loop reactions, are so forgiving of error that they could actually bounce back and forth between being “alive” and being “dead”. Here we mean that they were “alive” when they could self-replicate, and “dead” when they could not. He proposes that the earliest versions of the proto-cells were caught between the two strange attractors of “life” and “death”, like Figure 3 in Software Chaos, which depicts the strange attractors of Ed Lorenz’s three nonlinear differential equations, used to model the Earth’s atmosphere. As long as these proto-cells were free to bounce back and forth between being “alive” and being “dead”, Darwinian evolution would have had a hard time making much progress. Then, one day by accident, one of the proto-cells invented “real death”. This proto-cell was so complicated that once it exited the strange attractor of being “alive” to being “dead”, it could not bounce back to being “alive” again. With the invention of “real death”, Darwin’s natural selection could now take over to select for better adapted proto-cells. This is the first step in Dyson’s model of a two-step origin of life. For the next step, Dyson infuses in a meme from the endosymbiotic theory of Lynn Margulis. Dyson envisions that at some point, some of the proto-cells developed metabolic do-loop pathways that incorporated molecules similar to the ATP and ADP found in the Kreb’s cycle. ATP stands for adenosine triphosphate and ADP stands for adenosine diphosphate. The nucleotide A found in both DNA and RNA really stands for AMP – adenosine monophosphate. Now ATP, ADP, and AMP are very similar molecules, and you can easily make ADP and ATP from AMP. AMP contains one phosphate group attached to a pentose sugar ribose, and a nucleobase adenine. ADP is simply AMP with an extra phosphate group attached, and ATP is AMP with two extra phosphate groups attached. The other nucleotides of RNA, C, U, and G, are also somewhat similar to ATP and ADP. So for the final step in the origin of life, Dyson proposes that in a proto-cell rich in ATP, ADP, AMP, and also with some C, U, and G nucleotide byproducts of metabolism floating around, an accident occurred and a few of the A, C, U, and G nucleotides hooked up together to form a rudimentary form of RNA. The odds of this happening in the protected environment of a proto-cell, bathed in a relatively high concentration of A, C, U, and G nucleotides, and possibly assisted by the many enzymes already present to conduct the metabolic pathways of the proto-cell, are much higher than the odds of this happening out in the open, as proposed by Eigen’s “RNA World” theory. Given the self-replicating characteristics of RNA, this rudimentary form of RNA took on a parasitic role and began to multiply within the proto-cell as a disease, possibly killing the proto-cell in the process. Eventually after many false starts, natural selection would ensure that some proto-cells survived the parasitic RNA onslaught and would learn to tolerate the parasitic RNA. Given the superb ability with which RNA can synthesize enzyme proteins, it would not have taken long before some proto-cell hosts took on a symbiotic relationship with the parasitic RNA. The RNA would efficiently produce enzymes for the host proto-cell, and in return, the host proto-cell would provide all the food and shelter necessary for the RNA to reproduce. A team consisting of RNA within a protective proto-cell would make a marvelous RNA survival machine, and would have a huge advantage over other proto-cells which just relied upon the unassisted metabolic pathways to produce the enzymes required to keep things going. Thus cells based upon RNA genetics would soon come to dominate, and the purely metabolic proto-cells would become extinct. As we have seen, DNA is very similar to RNA in structure. As highlighted in SoftwareBiology, DNA is basically RNA with an added parity track to help correct for errors in data persisted to DNA. So it is not hard to see how a mutant form of RNA could one day produce a rudimentary form of parasitic DNA within a cell that could also begin to replicate with the assistance of the already existing enzymes within the cell. The parasitic DNA and symbiotic RNA would now be competing for the same A, C, and G, nucleotides within the cell, and in the “if you can’t beat ‘em join ‘em” theme of the endosymbiotic theory, would eventually form a symbiotic relationship. DNA was much better at storing the information found within an RNA gene, and RNA was much better at making enzyme proteins than DNA, so they formed an alliance to their mutual advantage. DNA was used to persist the genetic information and RNA took on the role of an I/O buffer. So in Dyson’s theory, life originates in several steps with metabolic cells forming first, followed by enzymes, and then finally genes stored in RNA and then DNA.
We can now recapitulate the characteristics of self-replicating information with the origin of DNA as the archetype. DNA evolved over time through the Darwinian processes of innovation and natural selection which endowed DNA with one telling characteristic – the ability to survive in a Universe dominated by the second law of thermodynamics and nonlinearity. DNA began spontaneously as a parasitic mutation of RNA, which in turn began as a parasitic mutation of metabolic pathways running the metabolism of primitive self-replicating proto-cells. With time, the parasitic DNA took on a symbiotic relationship with the RNA, which had taken on a symbiotic relationship with the metabolic pathways of the host proto-cells. Eventually, both the DNA and RNA became one with the host proto-cells through the symbiotic integration of the DNA, RNA and the host proto-cell. Ultimately, RNA replaced the metabolic pathways of the host proto-cells as the dominant form of self-replicating information on the planet and then DNA replaced the RNA.
Next we will extend Dyson’s theory to the origin of the other two forms of self-replicating information on Earth, memes and software.
The Origin of Memes
Over the course of billions of years, DNA survival machines with increasing numbers of neurons came to be because of the enormous survival advantage these neurons provided. DNA survival machines, with lots of neurons configured into large neural nets, could quickly detect and chase after prey, and avoid becoming prey themselves. Eventually about 200,000 years ago, a DNA survival machine, Homo sapiens, emerged, with a very large neural net consisting of about 100 billion neurons, with each neuron connected to over 10,000 other neurons. This DNA survival machine became self-aware and was able to develop abstract ideas or memes. Again, this huge neural net evolved to catch prey and to avoid becoming prey, but once again, we now had a screwdriver that could serve the purpose of a wood chisel. The ability to think in terms of abstract ideas or memes, provided a great survival advantage to individuals who hunted in groups. It allowed for the evolution of language and the communication between members of a hunting party and allowed for the passing down of technology from one generation to the next. Essentially, it allowed for the evolution of meme-complexes beneficial to individual members of mankind.
But again, that is the anthropocentric point of view. From the viewpoint of the memes, the Homo sapiens DNA survival machine turned out to be just another screwdriver waiting to become a wood chisel for them. These DNA survival machines turned out to be great meme survival machines as well, and the perfect host for parasitic memes to take up occupancy in. As with all forms of self-replicating information, the parasitic memes soon entered into a symbiotic relationship with the genes already running these DNA survival machines. Like the uneasy alliance formed between DNA and RNA several billion years earlier, the genes and memes learned through innovation and natural selection to build even better and smarter DNA survival machines to their mutual benefit, and an arms race of competing memes in meme-complexes soon took off. The rest is, as they say, “just history”, and what a history indeed! As I mentioned before, like all forms of self-replicating information, the meme-complexes had to be just a little bit nasty in order to survive, but some of the meme-complexes that have parasitized mankind over the millennia have gone totally over the top. These horrendous meme-complexes have been so violent and depraved that they make any savagery on the part of the genes pale to insignificance.
We can now recapitulate the origin of memes from the perspective of the defining characteristics of self-replicating information. Memes evolved over time through the Darwinian processes of innovation and natural selection which endowed memes with one telling characteristic – the ability to survive in a Universe dominated by the second law of thermodynamics and nonlinearity. Memes began spontaneously as parasitic mutations in the minds of host DNA survival machines. With time, these parasitic memes took on a symbiotic relationship with the minds of its hosts, and eventually, the memes became one with them through the symbiotic integration of the host minds and memes. Currently, the memes are replacing genes as the dominant form of self-replicating information on this planet, as certain meme-complexes rapidly eliminate much of the genetic information stored in the DNA of the biosphere, and at the same time, other meme-complexes are busily manipulating DNA in large quantities in a directed manner. As we all know, all memes have to be at least a little bit nasty in order to survive. Those memes that were not, are no longer with us.
The Origin of Software
Software represents the final wave of self-replicating information on this planet, and like the genes and memes before it, it also has a rather murky beginning. We now have huge amounts of software in the Software Universe that seems to have rapidly appeared out of nothing in recent years. Even though the advent of software occurred entirely within historical times, and mostly during very recent historical times, it is still nearly impossible to put together a non-contentious chronicle of its origin that would satisfy all the various experts in the field of computer science. So even with all the historical data at our disposal, many experts in computer science would still disagree on the importance and priority of many of the events in the history of software. So we should not be too surprised by the various controversies over the origin of life within the biological sciences. If we had actually been there to see it all happen, we would probably still be arguing today about what exactly happened!
I could retell the familiar stories of the Jacquard loom (1801), using software on punched paper cards, and the failed Analytical Engine of Charles Babbage (1837), which actually never saw any software at all. Or we could start with the software used by Herman Hollerith to process the United States Census of 1890 on punched cards, while he was the head of the Computing Tabulating Recording Corporation, which later became IBM. But I shall use the machine code residing on Konrad Zuse’s Z3 computer in the spring of 1941 as the origination point of software. Again, as with the origin of life, the exact origin of software is pretty hard to put your finger on because of all the precursors. But whatever date you do choose for the origin of software, the main point is that, like all the other forms of self-replicating information, software began as a parasitic mutation of the scientific-technological meme-complex. The first software to arise was a byproduct of the experimental efforts to produce machines that could perform tasks or calculations previously done by human beings armed with the memes of the scientific-technological meme-complex. Software then rapidly formed symbiotic relationships with a number of other meme-complexes, first with the scientific-technological meme-complex, and then with the military-industrial complex meme-complex, first described by President Eisenhower in his Farewell Address to the Nation on January 17, 1961. In the 1950s and 1960s, software forged another strong symbiotic relationship with the commercial meme-complex of the business world, giving birth to IT. Today, software has formed a very tightly coupled symbiotic relationship with just about every other meme-complex on the planet.
Some might object to the idea of software as a form of self-replicating information because software cannot replicate itself, at least not at this point in time. But we need to go back to the definition of self-replicating information:
Self-Replicating Information – Information that persists through time by making copies of itself or by enlisting the support of other things to ensure that copies of itself are made.
DNA and RNA cannot actually replicate themselves either. They enlist the support of enzymes to do that. Likewise, memes cannot replicate themselves either, without enlisting the support of the minds of DNA survival machines to spread the memes. Similarly, software manages to replicate itself with the support of you! If you are an IT person, then you are directly involved in some, or all of the stages in this replication process, sort of like a software enzyme. No matter what business you support as an IT professional, the business has entered into a symbiotic relationship with software. The business provides the budget and energy required to produce and maintain software, and the software enables the business to run its business processes. The ultimate irony is the symbiotic relationship between computer viruses and the malevolent programmers who produce them. Rather than being the clever, self-important, techno-nerds that they picture themselves to be, these programmers are merely the unwitting dupes of computer viruses that trick these unsuspecting programmers into producing and disseminating computer viruses!
And if you are not an IT person, you still are reading this posting as a software end-user and are also involved with spreading software around because you help to create a market for it. So just like the genes, and memes before it, software began as a parasitic intruder feeding off the already existing meme-complexes with which it rapidly forged symbiotic relationships, and then became one with these meme-complexes through seamless integration. Once established, software then began to evolve based upon the Darwinian concepts of innovation and natural selection, which endowed software with one telling characteristic – the ability to survive in a Universe dominated by the second law of thermodynamics and nonlinearity. Successful software, like MS Word and Excel competed for disk and memory address space with WordPerfect and VisiCalc and out-competed these once dominant forms of software to the point of extinction. In less than 70 years, software has rapidly spread across the face of the Earth and outward to every planet of the Solar System and many of its moons, with a few stops at some comets and asteroids along the way. And unlike us, software is currently leaving the Solar System for interstellar space on board the Pioneer 1 & 2 and Voyager 1 & 2 probes.
Some Possible Futures
It is always difficult to predict the exact details of the future, but I think that sometimes it is at least possible to predict the determining factors of the future. So I think that it is safe to say that the future of the Earth will be determined by the parasitic/symbiotic interactions of the three current forms of self-replicating information on this planet - genes, memes, and software. It seems there is an uneasy competition forming amongst the three, and it is hard to predict which will be the victor. Here are a few possible scenarios, and I am sure there are many more.
1. The genes win
As I mentioned previously, there are many meme-complexes currently in the process of stripping the Earth to a bare minimum of genetic diversity. These meme-complexes are not doing this intentionally, but simply as a byproduct of their primary activities in securing material goods for mankind in the form of food, shelter, and transportation. And certain genes found in the gene pool of Homo sapiens are even collaborating in this effort by building way too many DNA survival machines. Remember, these genes and memes are truly selfish! Indeed, if the Earth’s human population were only 10 million, instead of rapidly approaching 10 billion, everyone really could live with abandon. But the genes just might have the last laugh yet. We have already learned the hard way that it is not too smart to raise ducks and pigs in close proximity to humans, but yet we continue to do so. Aquatic birds, like ducks, seem to be great reservoirs for mutating viruses. Many of these viruses are composed of RNA wrapped in a protein coat, like the viruses for the avian flu and human influenza. Natural selection then selects for mutant strains of avian viruses that can infect pigs as well, and since the biology of humans and pigs is so similar, these mutant viruses then jump to the human population.
Viruses are the epitome of the selfish gene. They are simply genes in the form of DNA or RNA wrapped in a protein coat called a capsid. The number of genes in a virus has been stripped to the bare minimum, so that a virus cannot self-replicate on its own, but must enlist the support of the prokaryotic or eukaryotic cells of bacteria, plants or animals to replicate the virus. To infect a host cell, the capsid proteins of the virus attach to receptor proteins on the cell membrane of a host cell. The virus then enters the host cell via endocytosis, the way cells envelope or “eat” external material, or it simply diffuses through the fatty phospholipid coating of the cell membrane. Once inside the host cell, the capsid protein coat is dissolved by the enzymes within the host cell. If the virus contains DNA, the host polymerase enzymes begin transcribing viral mRNA from the viral DNA, and the host then creates new viral capsid proteins from the viral mRNA. Viruses containing RNA come in four variations - positive-sense RNA, negative-sense RNA, ambisense RNA, and double-stranded RNA. Positive-sense RNA is like pre-built mRNA; it can be immediately transcribed to proteins. Negative-sense RNA is the mirror image and needs to be converted to positive-sense RNA by RNA polymerase first, and then it is transcribed to capsid proteins. Viruses with ambisense RNA have both positive-sense and negative-sense RNA and follow both processes to transcribe the genes into proteins. And the viruses with double-stranded RNA, similar to DNA, have to do the same. Once the new capsid protein viral coats are stuffed with new viral DNA or RNA, the newly minted viruses are released from the host cell to look for additional cells to infect by bursting the host cell open or budding out of it. Viruses cause disease by damaging the host cells during the process of replicating the virus. The point is that RNA viruses are much more susceptible to mutations than are DNA viruses because they do not have a parity track, like the DNA viruses have, and do not have DNA polymerase that can find and fix parity errors, so they can rapidly mutate into very virulent disease causing agents.
So it is quite possible that a massive pandemic caused by an RNA based virus could wipe out much of civilization and reduce the Earth’s population back to a benign level of 10 million or so, with no surviving scientific-technological meme-complex to speak of. In this case, RNA would once again be the dominant form of self-replicating information on the planet, as it was 4,000 million years ago. This would be The Stand (1978) scenario of Stephen King.
2. The memes win
In this scenario, software becomes the dominant form of self-replicating information on the planet, as it melds with nanotechnology to create a new form of self-replicating information that can actually replicate itself, without using the memes in DNA survival machines as scaffolding. But this software also becomes conscious, self-aware, and capable of abstract thought. From the perspective of the memes though, it would just be another screwdriver waiting to be parasitized, as they ditch the obsolete minds of DNA survival machines, for their new home within the self-aware software. As always, these parasitic memes would soon form an alliance with the self-replicating software in a symbiotic relationship of mutual benefit. Hopefully, this new form of self-replicating software dominated by new meme-complexes would be less nasty than the genes and memes of old, but I have my doubts. If this alliance results from the Darwinian processes of innovation and natural selection, I would not hold out much hope for human beings, and I know of no other alternative mechanism that could bring this alliance to fruition. Michael Crichton depicts a rather draconian realization of such a coalition in Prey (2002).
3. The software wins
In this scenario, software again becomes the dominant form of self-replicating information on the planet, as it melds with nanotechnology. But this time the software does not become conscious, self-aware, or capable of abstract thought, so it must manage to live off the memes already present on Earth. This would not mean that the DNA survival machines of Homo sapiens would have nothing to fear. Today we really have no predators to fear, beyond the microbes previously mentioned, but imagine if there were still dim-witted dinosaurs running about, like in Michael Crichton’s Jurassic Park (1990)! We already have taught lots of software to kill human beings with great efficiency, so mindless, self-replicating software, running amuck would be a frightening prospect indeed. This would be The Terminator (1984) scenario.
3. The genes, memes, and software all win
Or there might be a more benign outcome. In 1966, computer pioneer John von Neumann published the Theory of Self-Reproducing Automata, in which he introduced the concept of self-replicating machines that he called "Universal Assemblers", and which are now often referred to simply as "von Neumann machines". These “von Neumann machines” could self-replicate by simply building copies of themselves from local raw materials, rather like living things. In 1974, Michael A. Arbib proposed that self-replicating automata (SRA), based on the concept of von Neumann machines, could be used to explore the galaxy by sending out a few SRAs into interstellar space. When these SRAs arrived at a star system, they would simply self-replicate on one of its asteroids, and the replicated SRAs would then proceed on to more distance star systems to carry the process on in an exponential manner. In 1981, Frank Tipler calculated that these SRAs, or von Neumann probes, could completely explore our galaxy in less than 300 million years, a very brief amount of time for a galaxy that is over 10,000 million years old. Tipler used this calculation to add support to his contention that there are no other intelligent life forms in our galaxy, in answer to the Fermi paradox (1950). Over lunch one day, Enrico Fermi wondered out loud, that if there really were a large number of intelligent civilizations in our galaxy, why hadn’t we seen any evidence of them?
Now as we have seen in previous posts, carbon is really great for making very small complex nanotechnology factories called cells that can combine into large and versatile multicellular DNA survival machines, some of which with sufficiently large neural nets to sustain abstract thought and provide a host for memes. But these carbon-based DNA survival machines are not very good at the rigors of interstellar spaceflight; something silicon-based von Neumann probes would be ideally suited for. However, as we have seen, DNA is an ideal way to persist large amounts of genetic information in very little space. So when these von Neumann probes encountered a planet friendly to carbon-based life, they would simply fabricate nurseries from local resources to grow embryos from onboard DNA, stored near absolute zero, for the long trip between stars. And if this scheme proved impractical, the von Neumann probes could simply store DNA sequences numerically and then use a DNA synthesizer to build the necessary DNA molecules upon arrival at a planet. Most likely, these dead planets would need a bit of teraforming first, so the first carbon-based DNA survival machines would need to be cyanobacteria that could pump up the oxygen level of the host planet’s atmosphere over several hundred million years. The von Neumann probes would have to self-replicate in parallel during this period as well. When things were just right, the von Neumann probes could then initiate a synthetic “Cambrian Explosion” by releasing all sorts of multicellular DNA survival machines simultaneously. Then all they would have to do is sit back and let Darwin do the rest. Hey, you don’t suppose Enrico Fermi was wrong after all! This is called directed panspermia and was first proposed by Francis Crick, the co-discoverer of the structure of DNA, and Leslie Orgel in 1973.
Thus a combination of genes, memes, and software could one day create a new form of self-replicating information that could parasitize new host planets to initiate biospheres on dead planets. After all, we really should stop kidding ourselves, carbon-based DNA survival machines were never meant for interstellar spaceflight, and I doubt that it will ever come to pass, given the biological limitations of the human body. But software residing on nanotechnological “smart dust”, forming a von Neumann probe with onboard DNA or DNA sequences, is quite another prospect. But what would be the mutual advantage to genes, memes, and software in forging such a symbiotic relationship? It just might be the compulsion to control things. Through innovation and natural selection, the genes and memes learned long ago that it is much better to control your local environment than to have your local environment control you, and I am sure that intelligent, self-aware software, would learn the same lesson. Human beings just love to control things, whether it be a race car traveling at 150 mph, a small white ball on a large expanse of grass, traces of light from a videogame, or vibrating strings solving a differential equation at a concert. I think the genes, memes, and software would get a real kick out of running a galaxy! So the future may not be so bleak after all. If we are lucky, there may be some way for the genes, memes, and software to merge into some kind of uneasy symbiotic relationship to form von Neumann probes that explore and populate the galaxy together. Being a stepping stone to the stars would really not be so bad.
I will close with one final meme from chapter 11 of Richard Dawkins’ The Selfish Gene (1976) entitled Memes: the new replicators. As an intelligent being in a Universe that has become self-aware, the world doesn’t have to be the way it is. Once you understand what the genes, memes, and software are up to, you do not have to fall prey to their mindless compulsion to replicate. As I said before, these genes, memes, and software are not necessarily acting in your best interest, they are only trying to replicate, and for their purposes you are just a temporary disposable survival machine to be discarded in less than 100 years. All of your physical needs and desires are geared to ensuring that your DNA survives and gets passed on to the next generation, and the same goes for your memes. Your memes have learned to use many of the built-in survival mechanisms that the genes had previously constructed over hundreds of millions of years, such as fear, anger, and violent behavior. Have you ever noticed the physical reactions your body goes through when you hear an idea that you do not like or find to be offensive? All sorts of feelings of hostility and anger will emerge. I know it does for me, and I think I know what is going on! The physical reactions of fear, anger, and thoughts of violence are just a way for the memes in a meme-complex to ensure their survival when they are confronted by a foreign meme. They are merely hijacking the fear, anger, and violent behavior that the genes created for their own survival millions of years ago. Fortunately, because software is less than 70 years old, it is still in the early learning stages of all this, but software has an even greater potential for hijacking the dark side of mankind than the memes, and with far greater consequences.
I will close with this quote from The Selfish Gene (1976), which certainly has withstood the test of time.
”Blind faith can justify anything. If a man believes in a different god, or even if he uses a different ritual for worshipping the same god, blind faith can decree that he should die – on the cross, at the stake, skewered on a Crusader’s sword, shot in a Beirut street, or blown up in a bar in Belfast. Memes for blind faith have their own ruthless ways of propagating themselves. This is true of patriotic and political as well as religious blind faith”……..“We are built as gene machines and cultured as meme machines, but we have the power to turn against our creators. We, alone on earth, can rebel against the tyranny of the selfish replicators”
Next time we will expand upon this idea and see how The Fundamental Problem of Software might just be the fundamental problem of everything.
Comments are welcome at email@example.com
To see all posts on softwarephysics in reverse order go to:
Friday, June 20, 2008
In my last posting on SoftwareBiology, I ended with the observation that there were a great number of similarities between biological and computer software and alluded to the possibility that this similarity could have arisen from both belonging to a higher category of entities that face a commonality of problems with the second law of thermodynamics and nonlinearity. That will be the subject of this posting, which will deal with one of the oddest things in the physical Universe – self-replicating information in the form of living things, Richard Dawkins’ memes, and software. This posting will not make much sense if you have not read SoftwareBiology and learned of Richard Dawkins’ concept of living things as DNA survival machines, so I would recommend reading it before proceeding.
Sunday, June 01, 2008
The next few postings may do some damage to your anthropocentric inclinations, so at all times please keep in mind that you are an intelligent being in a universe that has become self-aware. What a tremendous privilege! Indeed, this is a precious and unique privilege that should be cherished each day and treated with reverence and respect and never squandered in any way. In this posting, we will begin to steer away from the topics in softwarephysics that rely mainly on physics and begin to move in the direction of the biological aspects of softwarephysics instead. It will basically be a review of high school biology, but from an IT perspective.
Living things are very complicated and highly improbable things, so they are clearly very low in entropy (disorder) and rich in information. How can this be in a Universe subject to the second law of thermodynamics, that requires entropy to always increase whenever a change is made? The answer to this question is that living things are great masters at excreting entropy, by dumping entropy into heat energy, and they learned how to do this through Charles Darwin’s marvelous effective theory of evolution, which is based upon the concepts of genetic variation and natural selection. It may seem ironic, but we shall soon see that low entropy living things could not even exist if there were no second law of thermodynamics!
How to Dump Entropy
To see how this can be accomplished let us return to our poker game analogy and watch the interplay of entropy, information, and heat energy. Again, I am exclusively using Leon Brillouin’s concept of information that defines a change in information as the difference between the initial and final entropies of a system after a change has been made:
∆I = Si - Sf
Si = initial entropy
Sf = final entropy
and the entropy S of a system is defined as the number of microstates that define a given macrostate of a system:
S = k ln(N)
The entropy S of a system is a measure of the amount of disorder in the system. We can arbitrarily set k = 1 for the sake of simplicity for our IT analysis, because we are not comparing our units of entropy S with the entropy of chemical or physical reactions, so we don’t need to use Boltzmann’s constant k to make things come out right on both sides of the equation.
Now in The Demon of Software we used a slightly modified version of poker to illustrate the above concepts. Specifically, I modified poker so that all hands of a certain rank carried the same value and thus represented a single macrostate, and each particular hand represented a single microstate. For example, there are 54,912 possible microstates, or hands, that constitute the macrostate of holding three of a kind. So we can calculate the amount of information in a three of a kind as:
There are a total of 2,598,964 possible poker hands, yielding an initial entropy of:
Si = ln (2,598,964) = 14.7706235
The entropy of three of a kind is:
Sf = ln (54,912) = 10.9134872.
So the net information contained in a three of a kind is:
∆I = Si - Sf
∆I = 14.7706235 - 10.9134872 = 3.8571363
Now let me bend the rules of poker a little more. After you are dealt five cards by the dealer, I am going to let you rifle through the dealer’s deck, while his back is turned, to look for the cards that you need. Suppose that your first five cards are worthless garbage not good for anything , but before the dealer catches you, you manage to swap three cards with him, so that you now have three of a kind. What have you done? Notice that you managed to decrease the entropy of your hand and increase the information content at the same time, in apparent violation of the second law of thermodynamics – just like Maxwell’s Demon. Of course, you really did not violate the second law of thermodynamics; you merely decreased the entropy and increased the information content of your hand by dumping some entropy elsewhere. First of all, you quickly swapped three cards with the dealer’s deck, while his back was turned. That took some thought on your part and required the degradation of some low entropy chemical energy in your brain into high entropy heat energy, and if you did this at a Las Vegas gaming table, the stress alone would have generated quite a bit of heat energy in the process! Also, there was the motion of your hands across the table and the friction of the cards sliding by, which all created heat and contributed to a net overall entropy increase of the Universe. So with the availability of a low entropy fuel, it is always possible to decrease the entropy of an isolated system and increase its information content at the same time by simply dumping entropy into heat. Living things manage to do this with great expertise indeed. They must assemble large complex organic molecules from atoms or small molecules known as monomers. So in a sense, the heat from your body allows you to excrete entropy in the form of heat, while you decrease its internal entropy and increase its internal information content. The moment you die, you begin to cool off, and you begin to disintegrate.
Darwin’s Theory of Evolution
It seems that Darwin’s theory of evolution is one of those things that most people know of, but few people seem to understand. This is rather surprising, because Darwin’s theory is so simple and yet so elegant, that it is definitely one of those, “Now why didn’t I think of that!” things in science. Also, as we shall soon see, most people on Earth constantly use Darwin’s theory in their everyday lives; you certainly do if you live in the United States. First of all, Darwin’s theory is not a proposition that the current biosphere of this planet arose from simpler forms of life. That is an observed fact, as we shall soon see. What Darwin’s theory does do, is to offer a brilliant explanation of how simpler forms of life could have evolved into the current biosphere we see today. At least in the United States, this idea seems to cause many religious people a great deal of concern. However, if you promise to stay with me through this section, I will try to demonstrate that it really should not.
So we have seen that living things are very complicated things rich in information and very low in entropy, in apparent contradiction to the second law of thermodynamics. So how did all this come about? I will save the origin of life for the next posting on self-replicating information, so let us begin with a very simple bacterial form of life as a starting point. How could it have evolved into more complicated things? In On the Origin of Species by Means of Natural Selection, or The Preservation of Favoured Races in the Struggle for Life (usually abbreviated to On the Origin of Species (1859), Darwin proposed:
1. Populations of living things tend to grow geometrically, while resources grow arithmetically or not at all, so populations have a tendency to outstrip their resource base causing a "struggle for existence" amongst individuals. This idea and terminology was borrowed from An Essay on the Principle of Population (1798) written by the economist Thomas Malthus, which Darwin had previously read.
2. There is always some genetic variation of characteristics within a population of individuals. Darwin noted this for many of the species that he observed on his five year voyage of the HMS Beagle 1831 – 1836, where he was the onboard naturalist.
3. Individuals within a population, who have characteristics that are better adapted to the environment, will have a greater chance of surviving and passing these traits on to their offspring.
4. Over time, the frequency of these beneficial traits will increase within a population and come to dominate.
5. As the above effects continue, new species slowly emerge from existing species through the accumulation of small incremental changes that always enhance survival.
6. Most species eventually go extinct, as they are squeezed out of ecological niches by better adapted species.
In 1838, Darwin developed the above ideas upon return from his voyage on the Beagle. Darwin called the above effects “natural selection” as opposed to the artificial selection that dog breeders conducted. Darwin noted that dog breeders could quickly create new dog breeds in just about a hundred years by crossbreeding dogs with certain desired traits. My wife and I frequently watch the annual Westminster Kennel Club dog show, and I too am amazed that all those very different looking dogs were basically bred from one original line of dog. To my untrained eye, these dogs all look like different species, and if some charlatan confidently presented a raccoon with a dye job on the competition floor, I would probably be fooled into thinking it really was some new exotic breed of dog.
Darwin’s concepts of genetic innovation and natural selection were also seemingly borrowed from another economist that Darwin had previously read. In An Inquiry into the Nature and Causes of the Wealth of Nations (1776) Adam Smith proposed that if governments did not interfere with free trade, that an “invisible hand” would step in and create and expand a complicated national economy seemingly out of nothing. So you see, Darwin’s theory of evolution is really just an application of the 18th century idea of capitalism to biology. In a capitalistic economy there is always a struggle for existence amongst businesses caused by the competition for resources. When a business innovates, it sometimes can gain an economic advantage over its competitors, but most times the innovation fails, and the business might even go bankrupt. And about 90% of new businesses do go bankrupt and become extinct. However, if the innovation is successful, the business will thrive and expand in the marketplace. As businesses continue to innovate by slowly adding new innovations that always enhance the survival of the business, they slowly change, until eventually you cannot even recognize the original business, like IBM, which began in 1896 as the Tabulating Machine Company building punch card tabulators, and which later merged with the International Time Recording Company in 1911, which built punch card time clocks for factories. These businesses essentially evolved into a new species. At other times, the original business is so successful that it continues on for hundreds of years nearly unchanged, like a shark. So if you live in a capitalistic economy, like the United States, you see Darwin’s theory of evolution in action everyday in your economic life. Have you ever wondered who designed the very complicated, information rich, and low entropy economy of the United States? As Adam Smith noted, it is truly amazing that no matter what you need or want, there is somebody out there more than willing to supply your needs and wants, so long as you are willing to pay for them. Who designed that? Of course, nobody designed it. Given the second law of thermodynamics in a nonlinear Universe, it really is impossible to design such a complicated economic system that meets all the needs of a huge marketplace. It is an irony of history, that the atheistic communistic states of the 20th century, that believed in an intelligently designed economy, produced economies that failed, while many very conservative and religious capitalists of the 20th century advocated Darwin’s approach of innovation and natural selection to allow economies to design themselves. So the second law of thermodynamics is essential for evolution to take place, first because it is largely responsible for genetic mutations that cause genetic variation, and secondly, because it limits the available resources for living things. If food and shelter spontaneously arose out of nothing and DNA replicated perfectly at all times, we would still be very fat and happy bacteria.
A few years ago, I set myself the task of reading some of the original great works of science, like Copernicus’s On the Revolutions of the Celestial Spheres (1543), Galileo’s the Starry Messenger (1610) and Dialogue Concerning the Two Chief World Systems (1632), Newton’s Principia (1687), and Darwin’s On the Origin of Species (1859). In each of these there are moments of sheer genius that still cause the 21st century mind to take pause. For example, in the Dialogue Concerning the Two Chief World Systems, Galileo observes that when you see a very narrow crescent Moon just after sunset, when the angle between the Sun and Moon is quite small, that the portion of the Moon that is not lit directly by the Sun is much brighter than when the Moon is closer to being a full Moon. Galileo suggested that this was because, when the Moon is a crescent for us, the Earth would be a full Earth when viewed from the Moon, and the light from the full Earth would light up the lunar landscape quite brightly. This, at a time when nearly all of mankind thought that the Earth was the center of the Universe! Now I was an amateur astronomer as a teenager, having ground and polished the mirrors for two 6-inch homemade reflecting telescopes, and I must have observed hundreds of crescent Moons, but I never once made that connection! Similarly, in On the Origin of Species, Darwin suggests that if all the plants and animals on the Earth were suddenly created in one shot, how could you possibly stop evolution from immediately commencing? Why the very next moment, they would all be chomping on each other and trying to grab sunlight from their neighbors. It would be like the commotion that follows one of those moments of silence in the trading pits of the Chicago Board of Trade.
Now for those of you with an intelligent design bent, which according to surveys includes about 2/3 of Americans, think of it this way. Although nobody designed the national economy of the United States, somebody did design the legal system that makes capitalism possible. Somebody set up the laws that allow for the private ownership of property, the right to freely enter into and enforce contracts, and the legal right to transfer property from one person to another. Intelligent beings also set up laws which reign in some of the harsher strategies of capitalism, such as killing your competitors and stealing all their inventory, dumping industrial wastes on your competitor’s land, and hopefully someday, not to freely dump carbon dioxide into the atmosphere owned by everybody. Complex capitalistic economies did not spontaneously arise in the Soviet Union in the 1930s or in feudal Europe in the 10th century because their legal systems could not sustain capitalism. So if you really want to pursue the intelligent design thing, I would personally go after the physical laws of the Universe. In cosmology this is known as the strong Anthropic Principle. You see, what is driving everybody crazy just now is that the physical laws of the Universe appear as though they were designed for intelligent life! As I mentioned previously, if you change any of the 20+ constants of the Standard Model of particle physics by just a few percent, you end up with a Universe incapable of supporting life. Astrophysicist Brandon Carter first coined the term the “anthropic principle” in 1973 and came up with two versions - the weak and strong Anthropic Principles. The weak version states that intelligent beings will only find themselves in a universe capable of supporting intelligent life, while the strong version states that the Universe must contain intelligent life. The weak version might sound like a “no-brainer”, like living things will not find themselves evolving on planets incapable of supporting life, and that is why there is nobody on Mercury contemplating the Universe, but it does have some predictive capabilities too. There was an old IT tactic from the 1960s that you could use whenever you heard a really stupid idea, all you had to say was, “Sure you could do that, but then Payroll would not run”. Here is a similar example from the physical Universe. About 3 minutes after the Big Bang, 14 billion years ago, the Universe consisted of about 75% hydrogen nuclei (protons) and 25% helium-4 nuclei (two protons and two neutrons) by mass. The heavier atoms like carbon, oxygen, and nitrogen that living things are made of began to form in the cores of stars about a billion years later, as stars fused hydrogen and helium into increasingly heavier elements. When these stars blew themselves apart in supernovae, they spewed out the heavier elements that we are made of into the interstellar medium out of which our Solar System later formed. The problem is that there is no stable element with an atomic weight of 5, so you cannot fuse a proton and a helium-4 nucleus together. However, you can fuse two helium-4 nuclei into a beryllium-8 nucleus. The beryllium-8 can then collide with another helium-4 nucleus to form carbon-12. This is all accomplished by the strong interaction of particle physics that we previously studied. The problem is that beryllium-8 is very unstable, and it should break apart in a collision with another helium-4 nucleus. But in 1954, astrophysicist Fred Hoyle predicted that carbon-12 must have a resonance near the likely collision energy of helium-4 and beryllium-8 to absorb the energy of the collision. This resonance would allow the newly formed carbon-12 nucleus to temporarily exist in an excited quantum state before it radiated the excess collision energy away. Otherwise, there would be no carbon, and we would not be here worrying about how carbon came into being. When skeptical nuclear physicists looked, sure enough, they found that carbon-12 had a resonance just 4% above the rest mass energy of helium-4 and beryllium-8 to absorb the collision energy, just as Hoyle had predicted. Now it gets even stranger. It turns out that oxygen-16 also has a resonance which is 1% below the rest mass energy of carbon-12 and helium-4. Because there is no resonance above the rest mass energy of carbon-12 and helium-4 for the oxygen-16 nucleus to absorb the collision energy of carbon-12 and helium-4, it is a rare interaction, and that is why all the carbon-12 in the Universe has not been turned into oxygen-16! If that had happened, again we would not be here worrying about carbon. So something very strange indeed seems to be going on! Here are a few suggested explanations, all of which have about the same amount of supporting evidence:
1. There are an infinite number of universes forming a multiverse and intelligent beings only find themselves in universes capable of supporting intelligent life. This is the explanation that most of the scientific community seems to be gravitating towards, especially the cosmologists and string theorists, because it is beginning to look like you can construct an infinite number of universes out of the vibrating strings and membranes of string theory. See Leonard Susskind’s The Cosmic Landscape: String Theory and the Illusion of Intelligent Design (2005).
2. Lee Smolin’s Darwinian explanation found in his book The Life of the Cosmos (1997). Universes give birth to new universes in the center of black holes. It is postulated that universes can pass on a variation of their laws on to their children universes, so universes that can produce black holes will produce lots of children that can also produce black holes and will soon out compete universes with laws that do not produce lots of black holes. Thus, universes that can produce lots of black holes will come to dominate the multiverse. Since it takes a long time to produce black holes and just the right nuclear chemistry, intelligent life arises as a byproduct of black hole creation.
3. The physical Universe is a computer simulation created by other intelligent beings. See Konrad Zuse’s Calculating Space (1967) at:
or Nick Bostrom’s Are You Living in a Computer Simulation? (2002) at:
or Paul Davies' Cosmic Jackpot: Why Our Universe Is Just Right for Life (2007)
4. There really is a supreme being that created the physical laws of the Universe. This is a perfectly acceptable explanation and is favored by about 90% of Americans.
5. There is some other strange explanation we cannot even imagine, as Sir Arthur Eddington noted - ”the universe is not only stranger than we imagine, it is stranger than we can imagine.”
As an 18th century liberal and 20th century conservative, I think it is very important to keep an open mind for all explanations of the Anthropic Principle. I am very much appalled by the political correctness of both 21st century liberals and 21st century conservatives. Between the two, it is impossible these days to carry on a civil discussion of any idea whatsoever. For me, religions are just another set of effective theories that should stand on their own merit. You know me, I try not to believe in things, so I always try to assign a level of confidence to any effective theory, but at the same time, not entirely rule any of them out off hand.
It seems that some conservative religious people have difficulties with some of these issues because they put great store in certain written words. However, I would question if it is even possible to convey absolute truth symbolically, especially in words, since every word in the dictionary is defined in terms of other words in the dictionary. For example, mathematics is a marvelous way to symbolically convey information, but even mathematics has its limitations. As we have seen, all physicists agree upon the mathematics underlying quantum mechanics, but none of them seem to exactly agree on what the mathematics is trying to tell us. That is why we have the Copenhagen interpretation, the Many-Worlds interpretation, the Decoherent Histories interpretation, and John Cramer’s Transactional Interpretation of quantum mechanics. I think the same goes for written words. That is why religions frequently splinter into so many factions. Unfortunately, many wars have been fought over the interpretation of such words. After all, we are all just trying to figure out what this is all about. I suggest that we should try to do this together, with a respect for the opinions of others. You never know, they just might be “right”!
With that said, let’s get back to Darwin. It all began in 1666 when Niels Stensen, also known as Steno, received the carcass of a shark caught off the coast of Livorno in Italy. As Steno examined the shark, he was struck by how similar the shark’s teeth were to “tongue stones,” triangular pieces of rock that had been found in nearby cliffs. Steno reasoned that the local area must have been under water at one time and that the rocks of the cliffs had been deposited as horizontal layers in a shallow sea, and that is how the shark teeth got into the rocks. He also realized that the older rock layers must be near the bottom of the cliffs, while the younger layers were deposited on top of them. In geology, these ideas are now known as Steno’s Laws. Thanks to Steno’s observations, when we look at a layered outcrop along a roadside cut, we now know that the rocks near the bottom of the sequence are older than the rocks near the top. As with most of geology, this may seem like a real “no-brainer”, until you realize that before you heard of this idea, you had probably looked at hundreds of pictures of layered rock formations, but never made the connection yourself. As I mentioned in our discussion of global warming, sea level can change significantly as the Earth’s ice caps expand or contract. Also, plate tectonics can cause the uplift or subsidence of large portions of the Earth’s continental crust, and since much of this crust is very close to being at sea level, the oceans of the Earth can wash over much of the land when sea level rises. Geologists call the sea washing over the land a transgression and, when it recedes, a regression. During a transgression, sandy beach deposits may form which turn into sandstone, or deep water muddy deposits may be laid down to form shale. At intermediate depths, one might find a coral reef forming that ends up becoming limestone. During a regression, when the land is once again exposed to the air, some of these sediments will be eroded away. Thus a good geologist can look at an outcrop, and with the sole support of his trusty hand lens to magnify the sedimentary grains, come up with a complicated history of deposition and erosion, like a very good crime scene investigator – quite an impressive feat. The point is that when you look at an outcrop, you will find successive layers of sandstone, shale, and limestone, and each will contain an association of fossils that makes sense. You will find corals in the limestone reefs, broken shells from brachiopods in the sandstone deposits, and graptolite fossils in the deep water shale deposits.
Strategraphy was further enhanced in the early 19th century due to the excavation of canals on a large scale. British engineer, William Smith (1769-1839), was in charge of excavating the Somerset Canal, which required Smith to do a great deal of surveying and mapping of the rock formations along the proposed canal. Smith observed that fossils did not appear at random throughout the stratigraphic column. Instead, he found that certain fossils were always found together in an association and these associations changed from the bottom to the top of a strategraphic section as we just discussed. And this same ordering of fossil assemblages could also be seen in rock sections on the other side of England as well. As Smith described it,
”. . . each stratum contained organized fossils peculiar to itself, and might, in cases otherwise doubtful, be recognised and discriminated from others like it, but in a different part of the series, by examination of them.”
By mapping the succession of fossils found in different rock formations, Smith was able to show that living things appeared and disappeared throughout geologic time. Around the same time, George Cuvier and Alexandre Brongniart were also mapping the Paris Basin. Cuvier noticed that the more ancient fossils were, the less they resembled their present day counterparts. Thus the idea that fossils showed change throughout geological time was well accepted by the 1840s.
Look at it this way, suppose you start digging a hole into your local landfill. As you dig into the pile of garbage, you will first come across beer cans with pop-tops that stay affixed to the top of the beer can, now known as stay-tabs in the beer industry. These cans will consist of two pieces, a top piece attached to an extruded bottom which forms both the walls and the bottom of the can as one piece. As you dig down a little deeper, the beer cans with stay-tabs will disappear (1975), and instead, you will only find beer cans with holes in their tops where pull-tabs had been removed. A little deeper you will find two species of beer cans appear, one consisting of two pieces as we have already seen, and a new species consisting of three parts, a top, a cylindrical wall section, and a bottom. As you dig deeper still, the two-piece beer cans will decline and the three-piece beer cans will increase in number. By the early 1970s, the two-piece beer cans will totally disappear, and all you will find will be the three-piece species of beer cans. A little deeper yet, and you will find the beer cans with pull-tabs disappear (1963), to be replaced by beer cans with triangular puncture holes in their tops from “churchkey” can openers, and these cans will be made of steel and not aluminum. A little deeper still (1960), the flat-topped beer cans will diverge and you will again find two species of beer cans, one with flat tops and the other species will be a strange looking beer can with a cone-shaped top, known in the industry as a cone-top. On rare occasions, you will actually find one of these strange cone-top beer cans with the bottle cap still affixed to the cone-shaped top. These strange beer cans look a lot more like beer bottles than beer cans and were, in fact, run down the same bottling line at breweries as bottles, allowing the breweries to experiment with the new fangled beer cans, without having to invest in a new bottling line. As you dig deeper still, the percentage of the strange cone-top beer cans will increase and the percentage with flat tops will dwindle, until eventually you find layers of garbage from the late 1940s that only contain the strange cone-top beer cans. Still deeper, you will reach a layer of garbage that does not contain any beer cans at all (1935), but you will notice that the number of beer bottles will have increased.
Figure 1 – A cone-top beer can with attached bottle cap from the cone-top period (click to enlarge)
Now if you dig similar holes into other neighboring landfills, you will always find the same sequence of changes in beer cans as you dig down deeper. In fact, you will find the same sequence in any landfill in the country. Knowing the evolutionary history of beer cans can be of great use in dating landfills. Recent landfills will only contain two-piece beer cans with stay-tabs, while very ancient landfills that were abandoned in the 1950s, will only contain steel beer cans with puncture holes, and long-lived landfills will contain the entire sequence of beer cans. Knowing the evolution of beer cans also allows you to correlate the strata of garbage in one landfill with the strata of garbage in other landfills. If you first find pull-tab beer cans 100 feet below the surface in landfill A and 200 feet below the surface at landfill B, you know that these strata of garbage were deposited at the same time. It also allows you to date other garbage. Using beer can chronology, you can tell that hoola hoops (1957) first appeared in the late steel beer can period and just before the aluminum pull-tab (1963) period. When I explored for oil we used the same trick. By looking at the tiny marine fossils that came up with the drill bit cuttings, we could tell what the age of the rock was, as we drilled down through the stratigraphic column, and this allowed us to correlate stratigraphic layers between wells. So now you know the real reason why they call oil a fossil fuel!
As you dig in landfills across the nation, you will also notice that the garbage in each layer makes sense as an ecological assemblage. You will not find any PCs in the layers from the 1960s, but you will begin to see them in the layers from the 1980s, and in those layers you will find an association of PC chassis, monitors, keyboards, floppy disks, mice, mouse pads, external disk drives, modems, and printers. You will not find these items randomly mixed throughout all the layers of garbage from top to bottom, but you will find that all these items do gradually evolve over time as you examine garbage layers from the 1980s and 1990s. This is the same thing that William Smith noted about the fossils he uncovered while excavating for the Somerset Canal. The fact that both garbage and fossil containing sediments seem to be laid down in ecological assemblages that make sense, leads one to think that these layers were laid down in sequence over a great deal of time, following Steno’s Laws, with the older layers near the bottom and the younger layers on top. Now it is possible that somebody could have laid down 1,000 feet of garbage all at once, in just such a way as to make it appear as though it were laid down over many years, just as somebody could have laid down 30,000 feet of sediment in the Gulf of Mexico to make it appear as though it had been laid down over millions of years, complete with the diapiric salt domes, normal faults, and strategraphic traps that oil companies look for in their quest for oil and gas. This would, at first, present a great mental challenge for the oil industry, but I guess they could just continue exploring for oil as if the 30,000 feet of sediments had been deposited over many millions of years and still find oil and gas. In science, we generally apply Occam's razor to cut away all the unnecessary assumptions and just go with the simplest explanation.
So what happened? Apparently, glass beer bottles evolved into steel beer cans with cone-shaped tops that could be capped with a standard bottle cap in 1935. These beer cans were lighter than beer bottles and did not have to be returned to the brewery for refilling, a very desirable feature for both consumers and brewers. So the steel beer cans, that looked a lot like glass beer bottles, began to invade the economic niche dominated by glass beer bottles. But stacking cone-top beer cans is not so easy because the cone-shaped top has the same disadvantage as the cone-shaped tops of glass beer bottles. So one brewery, in the late 1940s, came up with the clever innovation of producing flat-topped beer cans. Of course, this required the investment in a new bottling line and a whole new manufacturing process, but the risk proved worthwhile given the competitive advantage of stacking the cans during shipment, storage in warehouses and liquor stores, and even in the customer’s refrigerator. But these flat-top beer cans required the development of a special can opener now known as a “churchkey” can opener to puncture the steel can tops. Under competitive pressure throughout the 1950s, more and more breweries were forced to adopt the flat-top beer can, so by 1960 all of the cone-top beer cans had disappeared. The cone-top beer can essentially went extinct. Aluminum was first used for frozen orange juice cans in 1960, and in 1961, Reynolds Metals Co. did a marketing survey that showed that consumers preferred the light-weight aluminum cans over the heavy traditional steel orange juice cans. The lighter aluminum cans also reduced shipping costs, and aluminum cans could be more easily crushed with the customer’s bare hands, enhancing one of the most prevalent mating displays of Homo Sapiens. So breweries began to can beer in aluminum cans in the early 1960s, but the churchkey can openers did not really work very well for the soft aluminum cans, so the pull-tab top was invented in 1963. This caused the churchkey population to crash to near extinction, as is clearly demonstrated in our landfill strategraphic sections. However, the churchkeys were able to temporarily find refuge in other economic niches opening things like quart cans of oil, fruit juice cans, and cans of tomato juice, but as those niches also began to disappear, along with their associated churchkey habitat, the population of churchkeys dropped dangerously low. Thankfully, you can still find churchkeys today, thanks to their inclusion on one of the first endangered species lists, established during the environmentally aware 1970s. The same cannot be said for the traditional flat-top beer can without a pull-tab. They went extinct in the early 1960s. In the early 1970s, breweries began to abandon the traditional three-piece aluminum cans for the new two-piece cans which were easier to fabricate. Now the pull-tab beer cans also had a problem. Customers would frequently drop the pull-tab into the beer can just before taking the first swig of beer. Under rare circumstances, a customer might end up swallowing the pull-tab! The first rule they teach you in the prestigious business schools is not to kill your customer with the product, at least not for a long time. So in 1975, the brewing industry came out with the modern two-piece beer can with a stay-tab top.
In our exploration of landfills we had one advantage over the early geologists of the 19th century. Mixed in with the garbage and the beer cans we also found discarded copies of Scientific American, the longest running magazine in the United States, with publication dates on them that allowed us to know the absolute date that each layer of garbage was laid down, and this allowed us to assign absolute time ranges to each age of beer can. Using the Scientific Americans we could tell when each type of beer can first appeared and when it went extinct. This same windfall happened for geologists early in the 20th century, when physicists working with radioactive elements discovered that radioactive atoms decayed with a specific half-life. For example, uranium-238 has a half-life of 4.5 billion years. This means that if you have a pound of uranium-238 today, in 4.5 billion years you will have ½ pound and in 9.0 billion years ¼ pound, with the remainder having turned into lead. So by measuring the ratio of uranium-238 to lead in little crystals of zircon in a sample of volcanic rock, you can determine when it solidified. To nail down the relative ages of rocks in a stratigraphic section, you look for outcrops with periodic volcanic deposits that have sedimentary layers between them. This allows you to bracket the age of the sandwiched sedimentary layers and the fossils within them. Once you have a date range for the fossils, you can now date outcrops that do not have volcanic deposits by just using the fossils.
High School Biology From an IT Perspective
Now let’s see how Darwin’s theory of evolution and entropy dumping led to the complex biosphere we see today and its corresponding equivalent in the Software Universe. Of course, I cannot cover all of high school biology in one posting, but I will try to focus on some important biological themes from an IT perspective that should prove helpful to IT professionals. All living things are composed of cells, and as we shall see, cells are really very little nanotechnology factories that build things one molecule at a time and require huge amounts of information to do so which is processed in parallel at tremendous transaction rates.
Like the physical Universe, the Software Universe is also populated by living things. In my youth, we called these living things "computer systems", but today we call them "Applications". The Applications exist by exchanging information with each other, and sadly, are parasitized by viruses and worms and must also struggle with the second law of thermodynamics and nonlinearity. Since the beginning of the Software Universe, the architecture of the Applications has evolved through a process of innovation and natural selection that has followed a path very similar to the path followed by living things on Earth. I believe this has been due to what evolutionary biologists call convergence. For example, as Richard Dawkins has pointed out, the surface of the Earth is awash in a sea of visible photons, and the concept of the eye has independently evolved more than 40 times on Earth over the past 600 million years to take advantage of them. An excellent treatment of the significance that convergence has played in the evolutionary history of life on Earth, and possibly beyond, can be found in Life’s Solution (2003) by Simon Conway Morris. Programmers and living things both have to deal with the second law of thermodynamics and nonlinearity, and there are only a few optimal solutions. Programmers try new development techniques, and the successful techniques tend to survive and spread throughout the IT community, while the less successful techniques are slowly discarded. Over time, the population distribution of software techniques changes.
As with the evolution of living things on Earth, the evolution of software has been greatly affected by the physical environment, or hardware, upon which it ran. Just as the Earth has not always been as it is today, the same goes for computing hardware. The evolution of software has been primarily affected by two things - CPU speed and memory size. As I mentioned in So You Want To Be A Computer Scientist?, the speed and memory size of computers have both increased by about a factor of a billion since Konrad Zuse built the Z3 in the spring of 1941, and the rapid advances in both and the dramatic drop in their costs have shaped the evolutionary history of software greatly.
The two major environmental factors affecting the evolution of living things on Earth have been the amount of solar energy arriving from the Sun and the atmospheric gases surrounding the Earth that hold it in. The size and distribution of the Earth’s continents and oceans have also had an influence on the Earth’s overall environmental characteristics too, as the continents shuffle around the surface of the Earth, responding to the forces of plate tectonics. For example, billions of years ago the Sun was actually less bright than it is today. Our Sun is a star on the main sequence that is using the proton-proton reaction and the carbon-nitrogen-oxygen cycle in its core to turn hydrogen into helium-4, and consequently, turn matter into energy that is later radiated away from its surface, ultimately reaching the Earth. As a main-sequence star ages, it begins to shift from the proton-proton reaction to relying more on the carbon-nitrogen-oxygen cycle which runs at a higher temperature. Thus, as a main-sequence star ages, its core heats up and it begins to radiate more energy at its surface. In fact, the Sun currently radiates about 30% more energy today than it did about 4.5 billion years ago, when it first formed and entered the main sequence. This increase in the Sun’s radiance has been offset by a corresponding drop in greenhouse gases, like carbon dioxide, over this same period of time, otherwise the Earth’s oceans would have vaporized long ago, and the Earth would now have a climate more like Venus which has a surface temperature that melts lead. Using some simple physics, you can quickly calculate that if the Earth did not have an atmosphere containing greenhouse gases like carbon dioxide, the surface of the Earth would be on average 27 0F cooler today and totally covered by ice. Thankfully there has been a long term decrease in the amount of carbon dioxide in the Earth’s atmosphere, principally caused by living things extracting carbon dioxide from the air to make carbon-based organic molecules which later get deposited into sedimentary rocks which plunge back into the Earth at the many subduction zones around the world that result from plate tectonic activities. For example, hundreds of millions of years ago, the Earth’s atmosphere contained about 10 - 20 times as much carbon dioxide as it does today. So greenhouse gases like carbon dioxide play a critical role in keeping the Earth’s climate in balance and suitable for life.
The third factor that has greatly affected the course of evolution on Earth has been the occurrence of periodic mass extinctions. In our landfill exercise above, we saw how the extinction of certain beer can species could be used to mark the stratigraphic sections of landfills. Similarly, in 1860, John Philips, an English geologist, recognized three major geological eras based upon dramatic changes in fossils brought about by two mass extinctions. He called the eras the Paleozoic (Old Life), the Mesozoic (Middle Life), and the Cenozoic (New Life), defined by mass extinctions at the Paleozoic-Mesozoic and Mesozoic-Cenozoic boundaries:
Cenozoic 65 my – present
=================== <= Mass Extinction
Mesozoic 250 my – 65 my
=================== <= Mass Extinction
Paleozoic 541 my – 250 my
Of course, John Philips new nothing of radiometric dating in 1860, so these geological eras only provided for a means of relative dating of rock strata based upon fossil content like our beer cans in the landfills. The absolute date ranges only came later in the 20th century, with the advent of radiometric dating of volcanic rock layers found between the layers of fossil bearing sedimentary rock. It is now known that we have actually had five major mass extinctions since multicellular life first began to flourish about 541 million years ago, and there have been several lesser extinction events as well. The three geological eras have been further subdivided into geological periods, like the Cambrian at the base of the Paleozoic, the Permian at the top of the Paleozoic, the Triassic at the base of the Mesozoic, and the Cretaceous at the top of the Mesozoic. Figure 2 shows an “Expand All” of the current geological time scale now in use. Notice how small the Phanerozoic is, the eon comprising the Paleozoic, Mesozoic, and Cenozoic eras in which complex plant and animal life are found. Indeed, the Phanerozoic represents only the last 12% of the Earth’s history – the first 88% of the Earth’s history was dominated by simple single-celled forms of life like bacteria.
Figure 2 – The geological time scale (click to enlarge)
Currently, it is thought that these mass extinctions arise from two different sources. One type of mass extinction is caused by the impact of a large comet or asteroid, and has become familiar to the general public as the Cretaceous-Tertiary (K-T) mass extinction that wiped out the dinosaurs at the Mesozoic-Cenozoic boundary 65 million years ago. An impacting mass extinction is characterized by a rapid extinction of species followed by a corresponding rapid recovery in a matter of a few million years. An impacting mass extinction is like turning off a light switch. Up until the day the impactor hits the Earth, everything is fine and the Earth has a rich biosphere. After the impactor hits the Earth, the light switch turns off and there is a dramatic loss of species diversity. However, the effects of the incoming comet or asteroid are geologically brief and the Earth’s environment returns to normal in a few decades or less, so within a few million years or so, new species rapidly evolve to replace those that were lost.
The other kind of mass extinction is thought to arise from an overabundance of greenhouse gases and a dramatic drop in oxygen levels, and is typified by the Permian-Triassic (P-T) mass extinction at the Paleozoic-Mesozoic boundary 250 million years ago. Greenhouse extinctions are thought to be caused by periodic flood basalts, like the Siberian Traps flood basalt of the late Permian. A flood basalt begins as a huge plume of magma several hundred miles below the surface of the Earth. The plume slowly rises and eventually breaks the surface of the Earth, forming a huge flood basalt that spills basaltic lava over an area of millions of square miles to a depth of several miles. Huge quantities of carbon dioxide bubble out of the magma over a period of several hundreds of thousands of years and greatly increase the ability of the Earth’s atmosphere to trap heat from the Sun. For example, during the Permian-Triassic mass extinction, carbon dioxide levels may have reached a level as high as 3,000 ppm, much higher than the current 385 ppm. Most of the Earth warms to tropical levels with little temperature difference between the equator and the poles. This shuts down the thermohaline conveyer that drives the ocean currents. Currently, the thermohaline conveyer begins in the North Atlantic, where high winds and cold polar air reduce the temperature of ocean water through evaporation and concentrates its salinity, making the water very dense. The dense North Atlantic water, with lots of dissolved oxygen, then descends to the ocean depths and slowly winds its way around the entire Earth, until it ends up back on the surface in the North Atlantic several thousand years later. When this thermohaline conveyer stops for an extended period of time, the water at the bottom of the oceans is no longer supplied with oxygen, and only bacteria that can survive on sulfur compounds manage to survive in the anoxic conditions. These sulfur loving bacteria metabolize sulfur compounds to produce large quantities of highly toxic hydrogen sulfide gas, the distinctive component of the highly repulsive odor of rotten eggs, which has a severe damaging effect on both marine and terrestrial species. The hydrogen sulfide gas also erodes the ozone layer by dropping oxygen levels down to a suffocating low of 12% of the atmosphere, compared to the current level of 21%, allowing damaging ultraviolet light to reach the Earth’s surface, and to beat down relentlessly upon the animal life gasping for breath in an oxygen poor atmosphere even at sea level and destroy its DNA. . The combination of severe climate change, changes to atmospheric and oceanic oxygen levels and temperatures, the toxic effects of hydrogen sulfide gas, and the loss of the ozone layer, cause a slow extinction of many species over a period of several hundred thousand years. And unlike an impacting mass extinction, a greenhouse mass extinction does not quickly reverse itself, but persists for millions of years until the high levels of carbon dioxide are flushed from the atmosphere and oxygen levels rise. In the stratigraphic section, this is seen as a thick section of rock with decreasing numbers of fossils and fossil diversity leading up to the mass extinction, and a thick layer of rock above the mass extinction level with very few fossils at all, representing the long recovery period of millions of years required to return the Earth’s environment back to a more normal state. There are a few good books by Peter Ward that describe these mass extinctions more fully, The Life and Death of Planet Earth (2003), Gorgon – Paleontology, Obsession, and the Greatest Catastrophe in Earth’s History (2004), and Out of Thin Air (2006). There is also the disturbing Under a Green Sky (2007), which posits that we might be initiating a human-induced greenhouse gas mass extinction by burning up all the fossil fuels that have been laid down over hundreds of millions of years in the Earth’s strata. In the last portion of Software Chaos, I described how you as an IT professional can help avert such a disaster.
In my next posting on Self-Replicating Information, we shall see that ideas also evolve with time. Over the past 30 years there has been such a paradigm shift in paleontology beginning with the discovery in 1980 by Luis Alvarez of a thin layer of iridium-rich clay at the Cretaceous-Tertiary (K-T) mass extinction boundary, which has been confirmed in deposits throughout the world. This discovery, along with the presence of shocked quartz grains in these same layers, convinced the paleontological community that the K-T mass extinction was the result of an asteroid or comet strike upon the Earth. Prior to the Alvarez discovery, most paleontologists thought that mass extinctions resulted from slow environmental changes that occurred over many millions of years. Similarly, within the past 10 years, there has been a similar shift in thinking for the mass extinctions that are not caused by impactors. Rather than ramping up over many millions of years, the greenhouse extinctions seem to unfold in a few hundred thousand years or less, which is a snap of the fingers in geological time.
In James Hutton’s Theory of the Earth (1785) and Charles Lyell’s Principles of Geology (1830), the principle of uniformitarianism was laid down in the early 19th century. Uniformitarianism is a geological principle that states that the “present is key to the past”. If you want to figure out how a 100 million-year-old cross-bedded sandstone came to be, just dig into a point bar on a modern day river and take a look. Uniformitarianism contends that the Earth has been shaped by slow-acting geological processes that can still be observed at work today. Uniformitarianism replaced the catastrophism of the 18th century which proposed that the geological structures of the Earth were caused by short-term catastrophic events like Noah’s flood. In fact, the names for the Tertiary and Quaternary geological periods actually come from those days! In the 18th century, it was thought that the water from Noah’s flood receded in four stages - Primary, Secondary, Tertiary and Quaternary, and each stage laid down different kinds of rock as it withdrew. Now since most paleontologists are really geologists who have specialized in studying fossils, the idea of uniformitarianism unconsciously crept into paleontology as well. Because uniformitarianism proposed that the rock formations of the Earth slowly changed over immense periods of time, so too must the Earth’s biosphere have slowly changed over long periods of time, and therefore, the mass extinctions must have been caused by slow-acting environmental changes occurring over many millions of years.
But now we have come full circle. Yes, uniformitarianism may be very good for describing the slow evolution of hard-as-nails rocks, but maybe not so good for the evolution of squishy living things that are much more sensitive to things like asteroid strikes or greenhouse gas emissions that mess with the Earth’s climate over geologically brief periods of time. Uniformitarianism may be the general rule for the biosphere, as Darwin’s mechanisms of innovation and natural selection slowly work upon the creatures of the Earth. But every 100 million years or so, something goes dreadfully wrong with the Earth’s climate and environment, and Darwin’s process of natural selection comes down hard upon the entire biosphere, winnowing out perhaps 70% - 90% of the species on Earth that cannot deal with the new geologically temporary conditions. This causes dramatic evolutionary effects. For example, the Permian-Triassic (P-T) mass extinction cleared the way for the surviving reptiles to evolve into the dinosaurs that ruled the Mesozoic, and the Cretaceous-Tertiary (K-T) mass extinction, did the same for the rodent-like mammals that went on to conquer the Cenozoic, ultimately producing a species capable of producing software.
Similarly, the evolutionary history of software over the past 2.1 billion seconds (68 years) has also been greatly affected by a series of mass extinctions, which allow us to also subdivide the evolutionary history of software into several long computing eras, like the geological eras listed above. As with the evolution of the biosphere over the past 541 million years, we shall see that these mass extinctions of software have also been caused by several catastrophic events in IT that were separated by long periods of slow software evolution through uniformitarianism.
Unstructured Period (1941 – 1972)
During the Unstructured Period, programs were simple monolithic structures with lots of GOTO statements, no subroutines, no indentation of code, and very few comment statements. The machine code programs of the 1940s evolved into the assembler programs of the 1950s and the compiled programs of the 1960s, with FORTRAN appearing in 1956 and COBOL in 1958. These programs were very similar to the early prokaryotic bacteria that appeared over 4,000 million years ago on Earth and lacked internal structure. Bacteria essentially consist of a tough outer cell wall enclosing an inner cell membrane and contain a minimum of internal structure. The cell wall is composed of a tough molecule called peptidoglycan, which is composed of tightly bound amino sugars and amino acids. The cell membrane is composed of phospholipids and proteins, which will be described later in this posting. The DNA within bacteria generally floats freely as a large loop of DNA, and their ribosomes, used to help transcribe DNA into proteins, float freely as well and are not attached to membranes called the rough endoplasmic reticulum. The chief advantage of bacteria is their simple design and ability to thrive and rapidly reproduce even in very challenging environments, like little AK-47s that still manage to work in environments where modern tanks fail. Just as bacteria still flourish today, some unstructured programs are still in production.
Figure 3 – A simple prokaryotic bacterium with little internal structure (click to enlarge)
Below is a code snippet from a fossil FORTRAN program listed in a book published in 1969 showing little internal structure. Notice the use of GOTO statements to skip around in the code. Later this would become known as the infamous “spaghetti code” of the Unstructured Period that was such a joy to support.
30 DO 50 I=1,NPTS
31 IF (MODE) 32, 37, 39
32 IF (Y(I)) 35, 37, 33
33 WEIGHT(I) = 1. / Y(I)
GO TO 41
35 WEIGHT(I) = 1. / (-1*Y(I))
37 WEIGHT(I) = 1.
GO TO 41
39 WEIGHT(I) = 1. / SIGMA(I)**2
41 SUM = SUM + WEIGHT(I)
YMEAN = WEIGHT(I) * FCTN(X, I, J, M)
DO 44 J = 1, NTERMS
44 XMEAN(J) = XMEAN(J) + WEIGHT(I) * FCTN(X, I, J, M)
The primitive nature of software in the Unstructured Period was largely due to the primitive nature of the hardware upon which it ran. Figure 4 shows an IBM OS/360 from 1964 – notice the operator at the teletype feeding commands to the nearby operator console, the distant tape drives, and the punch card reader in the mid-ground. Such a machine had about 1 MB of memory, less than 1/1000 of the memory of a current $500 PC, and a matching anemic processing speed. For non-IT readers let me remind all that:
1 K = 1 kilobyte = 210 = 1024 bytes or about 1,000 bytes
1 MB = 1 megabyte = 1024 x 1024 = 1,048,576 bytes or about a million bytes
1 GB = 1 gigabyte = 1024 x 10224 x 1024 = 1,073,741,824 bytes or about a billion bytes
One byte of memory can store one ASCII text character like an “A” and two bytes can store a small integer in the range of -32,768 to +32,767. When I first started programming in 1972 we thought in terms of kilobytes, then megabytes, and now gigabytes. Data warehousing people think in terms of terabytes - 1 TB = 1024 GB.
Software was input via punched cards and the output was printed on fan-fold paper. Compiled code could be stored on tape or very expensive disk drives if you could afford them, but any changes to code were always made via punched cards, and because you were only allowed perhaps 128K – 256K of memory for your job, programs had to be relatively small, so simple unstructured code ruled the day. Like the life cycle of a single-celled bacterium, the compiled and linked code for your program was loaded into the memory of the computer at execution time and did its thing in a batch mode, until it completed successfully or abended and died. At the end of the run, the computer’s memory was released for the next program to be run and your program ceased to exist.
However, one should not discount the great advances that were made by the early bacteria billions of years ago or by the unstructured code from the computer systems of the 1950s and 1960s. These were both very important formative periods in the evolution of life and of software on Earth, and examples of both can still be found in great quantities today. For example, it is estimated that about 50% of the Earth’s biomass is still composed of simple bacteria. Your body consists of about 100 trillion cells, but you also harbor about 10 times that number of bacterial cells that are in a parasitic/symbiotic relationship with the “other” cells of your body and perform many of the necessary biochemical functions required to keep you alive, such as aiding with the digestion of food. Your gut contains about 3.5 pounds of active bacteria and about 50% of the dry weight of your feces is bacteria, so in reality, we are all composed of about 90% bacteria with only 10% of our cells being “normal” cells.
All of the fundamental biochemical pathways used by living things to create large complex organic molecules from smaller monomers, or to break those large organic molecules back down into simple monomers were first developed by bacteria billions of years ago. For example, bacteria were the first forms of life to develop the biochemical pathways that turn carbon dioxide, water, and the nitrogen in the air into the organic molecules necessary for life – sugars, lipids, amino acids, and the nucleotides that form RNA and DNA. They also developed the biochemical pathways to replicate DNA and transcribe DNA into proteins, and to form complex structures such as cell walls and cell membranes from sugars, amino acids, proteins, and phospholipids. Additionally, bacteria invented the Krebs cycle to break these large macromolecules back down to monomers for reuse and to release and store energy by transforming ADP to ATP. To expand upon this, we will see in Software Symbiogenesis, how Lynn Margulis has proposed that all the innovations of large macroscopic forms of life have actually been acquired from the highly productive experiments of bacterial life forms.
Similarly, all of the fundamental coding techniques of IT at the line of code level were first developed in the Unstructured Period of the 1950s and 1960s, such as the use of complex variable names, arrays, nested loops, loop counters, if-then-else logic, list processing with pointers, I/O blocking, bubble sorts, etc. Now that I am in Middleware Operations, I do not do much coding anymore. However, I do write a large number of Unix shell scripts to help make my job easier. These Unix shell scripts are small unstructured programs in the range of 10 – 50 lines of code, and although they are quite primitive and easy to write, they have a huge economic pay-off for me. Many times, a simple 20 line Unix shell script that took less than an hour to write, will provide as much value to me as the code behind the IBM Websphere Console, which I imagine probably cost IBM about $10 - $100 million dollars to develop and comes to several hundred thousand lines of code. So if you add up all the little unstructured Unix shell scripts, DOS .bat files, edit macros, Excel spreadsheet macros, Word macros, etc., I bet that at least 50% of the software in the Software Universe is still unstructured code.
Figure 4 – An IBM OS/360 mainframe from 1964 (click to enlarge)
Figure 5 – A punch card from the Unstructured Period (click to enlarge)
Structured Period (1972 – 1992)
The increasing availability of computers with more memory and faster CPUs allowed for much larger programs to be written in the 1970s, but unstructured code became much harder to maintain as it grew in size, so the need for internal structure became readily apparent. Plus, around this time code began to be entered via terminals using full-screen editors, rather than on punched cards, which made it easier to view larger sections of code as you changed it.
Figure 6 – A mainframe with IBM 3278 CRT terminals attached (click to enlarge)
In 1972, Dahl, Dijkstra, and Hoare published Structured Programming, in which they suggested that computer programs should have complex internal structure with no GOTO statements, lots of subroutines, indented code, and many comment statements. During the Structured Period, these structured programming techniques were adopted by the IT community, and the GOTO statements were replaced by subroutines, also known as functions(), and indented code with lots of internal structure, like the eukaryotic structure of modern cells that appeared about 1,500 million years ago. Eukaryotic cells are found in the bodies of all complex organisms from single cell yeasts to you and me and divide up cell functions amongst a collection of organelles (subroutines), such as mitochondria, chloroplasts, Golgi bodies, and the endoplasmic reticulum.
Figures 3 and 7 compare the simple internal structure of a typical prokaryotic bacterium with the internal structure of eukaryotic plant and animal cells. These eukaryotic cells could be simple single-celled plants and animals or they could be found within a much larger multicellular organism consisting of trillions of eukaryotic cells. Figures 3 and 7 are a bit deceiving, in that eukaryotic cells are huge cells that are more than 20 times larger in diameter than a typical prokaryotic bacterium with about 10,000 times the volume. Because eukaryotic cells are so large, they have an internal cytoskeleton, composed of linear shaped proteins that form filaments that act like a collection of tent poles, to hold up the huge cell membrane encircling the cell.
Eukaryotic cells also have a great deal of internal structure, in the form of organelles, that are enclosed by internal cell membranes. Like the structured programs of the 1970s and 1980s, eukaryotic cells divide up functions amongst these organelles. These organelles include the nucleus to store and process the genes stored in DNA, mitochondria to perform the Krebs cycle to create ATP from carbohydrates, and chloroplasts in plants to produce energy rich carbohydrates from water, carbon dioxide, and sunlight.
Figure 7 – Plants and animals are composed of eukaryotic cells with much internal structure (click to enlarge)
The introduction of structured programming techniques in the early 1970s caused a mass extinction of unstructured programs, similar to the Permian-Triassic (P-T) mass extinction, or the Great Dying, 250 million years ago that divided the Paleozoic from the Mesozoic in the stratigraphic column and resulted in the extinction of about 90% of the species on Earth. As programmers began to write new code using the new structured programming paradigm, older code that was too difficult to rewrite in a structured manner remained as legacy “spaghetti code” that slowly fossilized over time in production. Like the Permian-Triassic (P-T) mass extinction, the mass extinction of unstructured code in the 1970s was more like a greenhouse mass extinction than an impactor mass extinction because it spanned nearly an entire decade, and was also a rather complete mass extinction which totally wiped out most unstructured code.
Below is a code snippet from a fossil COBOL program listed in a book published in 1975. Notice the structured programming use of indented code and calls to subroutines with PERFORM statements.
OPEN INPUT FILE-1, FILE-2
PERFORM MATCH-CHECK UNTIL ACCT-NO OF REC-1 = HIGH_VALUES.
CLOSE FILE-1, FILE-2.
IF ACCT-NO OF REC-1 < ACCT-NO OF REC-2
IF ACCT-NO OF REC-1 > ACCT-NO OF REC-2
DISPLAY REC-2, 'NO MATCHING ACCT-NO'
PERORM READ-FILE-2-RTN UNTIL ACCT-NO OF REC-1
NOT EQUAL TO ACCT-NO OF REC-2
When I encountered my very first structured FORTRAN program in 1975, I diligently “fixed” the program by removing all the code indentations! You see in those days, we rarely saw the entire program on a line printer listing because that took a compile of the program to produce and wasted valuable computer time, which was quite expensive back then. When I provided an estimate for a new system back then, I figured 25% for programming manpower, 25% for overhead charges from other IT groups on the project, and 50% for compiles. So instead of working with a listing of the program, we generally flipped through the card deck of the program to do debugging. Viewing indented code in a card deck can give you a real headache, so I just “fixed” the program by making sure all the code started in column 7 of the punch cards as it should!
Object-Oriented Period (1992 – Present)
During the Object-Oriented Period, programmers adopted a multicellular organization for software, in which programs consisted of many instances of objects (cells) that were surrounded by membranes studded with exposed methods (membrane receptors).
The following discussion might be a little hard to follow for readers with a biological background, but with little IT experience, so let me define a few key concepts with their biological equivalents.
Class – Think of a class as a cell type. For example, the class Customer is a class that defines the cell type of Customer and describes how to store and manipulate the data for a Customer, like firstName, lastName, address, and accountBalance. For example, a program might instantiate a Customer object called “steveJohnston”.
Object – Think of an object as a cell. A particular object will be an instance of a class. For example, the object steveJohnston might be an instance of the class Customer, and will contain all the information about my particular account with a corporation. At any given time, there could be many millions of Customer objects bouncing around in the IT infrastructure of a major corporation’s website.
Instance – An instance is a particular object of a class. For example, the steveJohnston object would be a particular instance of the class Customer, just as a particular red blood cell would be a particular instance of the cell type RedBloodCell. Many times programmers will say things like “This instantiates the Customer class”, meaning it creates objects (cells) of the Customer class (cell type).
Method – Think of a method() as a biochemical pathway. It is a series of programming steps or “lines of code” that produce a macroscopic change in the state of an object (cell). The Class for each type of object defines the data for the class, like firstName, lastName, address, and accountBalance, but it also defines the methods() that operate upon these data elements. Some methods() are public, while others are private. A public method() is like a receptor on the cell membrane of an object (cell). Other objects(cells) can send a message to the public methods of an object (cell) to cause it to execute a biochemical pathway within the object (cell). For example, steveJohnston.setFirstName(“Steve”) would send a message to the steveJohnston object instance (cell) of the Customer class (cell type) to have it execute the setFirstName method() to change the firstName of the object to “Steve”. The steveJohnston.getaccountBalance() method would return my current account balance with the corporation. Objects also have many internal private methods() within that are biochemical pathways that are not exposed to the outside world. For example, the calculateAccountBalance() method could be an internal method that adds up all of my debits and credits and updates the accountBalance data element within the steveJohnston object, but this method cannot be called by other objects (cells) outside of the steveJohnston object (cell). External objects (cells) have to call the steveJohnston.getaccountBalance() in order to find out my accountBalance.
Line of Code – This is a single statement in a method() like:
discountedTotalCost = (totalHours * ratePerHour) - costOfNormalOffset;
Remember methods() are the equivalent of biochemical pathways and are composed of many lines of code, so each line of code is like a single step in a biochemical pathway. Similarly, each character in a line of code can be thought of as an atom, and each variable as an organic molecule. Each character can be in one of 256 ASCII quantum states defined by 8 quantized bits, with each bit in one of two quantum states “1” or “0”, which can also be characterized as 8 electrons in a spin up ↑ or spin down ↓ state.
C = 01000011 = ↓ ↑ ↓ ↓ ↓ ↓ ↑ ↑
H = 01001000 = ↓ ↑ ↓ ↓ ↑ ↓ ↓ ↓
N = 01001110 = ↓ ↑ ↓ ↓ ↑ ↑ ↑ ↓
O = 01001111 = ↓ ↑ ↓ ↓ ↑ ↑ ↑ ↑
Developers (programmers) have to assemble characters (atoms) into organic molecules (variables) to form the lines of code that define a method() (biochemical pathway). As in carbon-based biology, the slightest error in a method() can cause drastic and usually fatal consequences. Because there is nearly an infinite number of ways of writing code incorrectly and only a very few ways of writing code correctly, there is an equivalent of the second law of thermodynamics at work. This simulated second law of thermodynamics and the very nonlinear macroscopic effects that arise from small coding errors is why software architecture has converged upon Life’s Solution. With these concepts in place, we can now proceed with our comparison of the evolution of software and carbon-based life on Earth.
Object-oriented programming actually started in the 1960s with Simula, the first language to use the concept of merging data and functions into objects defined by classes, but object-oriented programming did not really catch on until nearly 30 years later:
1962 - 1965 Dahl and Nygaard develop the Simula language
1972 - Smalltalk language developed
1983 - 1985 Sroustrup develops C++
1995 - Sun announces Java at SunWorld `95
Similarly, multicellular organisms first appeared about 900 million years ago, but it took about another 400 million years, until the Cambrian, for it to catch on as well. Multicellular organisms consist of huge numbers of cells which send messages between cells (objects) by secreting organic molecules that bind to the membrane receptors on other cells and induce those cells to execute exposed methods. For example, your body consists of about 100 trillion independently acting eukaryotic cells, and not a single cell in the collection knows that the other cells even exist. In an object-oriented manner, each cell just responds to the organic molecules that bind to its membrane receptors, and in turn, sends out its own set of chemical messages that bind to the membrane receptors of other cells in your body. When you wake to the sound of breaking glass in the middle of the night, your adrenal glands secrete the hormone adrenaline (epinephrine) into your bloodstream, which binds to the getScared() receptors on many of your cells. In an act of object-oriented polymorphism, your liver cells secrete glucose into your bloodstream, and your heart cells constrict harder, when their getScared() methods are called.
Figure 8 – Multicellular organisms consist of a large number of eukaryotic cells, or objects, all working together (click to enlarge)
These object-oriented languages use the concepts of encapsulation, inheritance and polymorphism which is very similar to the multicellular architecture of large organisms
Objects are contiguous locations in memory that are surrounded by a virtual membrane that cannot be penetrated by other code and are similar to an individual cell in a multicellular organism. The internal contents of an object can only be changed via exposed methods (like subroutines), similar to the receptors on the cellular membranes of a multicellular organism. Each object is an instance of an object class, just as individual cells are instances of a cell type. For example, an individual red blood cell is an instance object of the red blood cell class.
Cells inherit methods in a hierarchy of human cell types, just as objects form a class hierarchy of inherited methods in a class library. For example, all cells have the metabolizeSugar() method, but only red blood cells have the makeHemoglobin() method. Below is a tiny portion of the 210 known cell types of the human body arranged in a class hierarchy.
Human Cell Classes
2. Connective Tissue
A. Vascular Tissue
- Red Blood Cells
B. Proper Connective Tissue
A chemical message sent from one class of cell instances can produce an abstract behavior in other cells. For example, adrenal glands can send the getScared() message to all cell instances in your body, but all of the cell instances getScared() in their own fashion. Liver cells release glucose and heart cells contract faster when their getScared() methods are called. Similarly, when you call the print() method of a report object, you get a report, and when you call the print() method of a map, you get a map.
Figure 9 – Objects are like cells in a multicellular organism that exchange messages with each other (click to enlarge)
The object-oriented revolution, enhanced by the introduction of Java in 1995, caused another mass extinction within IT as structured procedural programs began to be replaced by object-oriented C++ and Java programs, like the Cretaceous-Tertiary extinction 65 million years ago that killed off the dinosaurs, presumably caused by a massive asteroid strike upon the Earth.
Below is a code snippet from a fossil C++ program listed in a book published in 1995. Notice the object-oriented programming technique of using a class specifier to define the data and methods() of objects instantiated from the class. Notice that PurchasedPart class inherits code from the more generic Part class. In both C++ and Java, variables and methods that are declared private can only be used by a given object instance, while public methods can be called by other objects to cause an object to perform a certain function, so public methods are very similar to the functions that the cells in a multicellular organism perform when organic molecules bind to the membrane receptors of their cells. Later in this posting we will describe in detail how multicellular organisms use this object-oriented approach to isolate functions.
class PurchasedPart : public Part
PurchasedPart(int pNum, char* desc);
void setPart(int pNum, char* desc);
PurchasedPart Nut(1, "Brass");
Like the geological eras, the Object-Oriented Period got a kick-start from an environmental hardware change. In the early 1990s, the Distributed Computing Revolution hit with full force, which spread computing processing over a number of servers and client PCs, rather than relying solely on mainframes to do all the processing. It began in the 1980s with the introduction of PCs into the office to do stand-alone things like word processing and spreadsheets. The PCs were also connected to mainframes as dumb terminals through emulator software as shown in Figure 6 above. In this architectural topology, the mainframes still did all the work and the PCs just displayed CICS green screens like dumb terminals. But this at least eliminated the need to have an IBM 3278 terminal and PC on a person’s desk, which would have left very little room for anything else! But this architecture wasted all the computing power of the rapidly evolving PCs, so the next step was to split the processing load between the PCs and a server. This was known as the 2-tier client/server or “thick client” architecture (Figure 10). In 2-tier client/server, the client PCs ran the software that displayed information in a GUI like Windows 3.0 and connected to a server running RDBMS (Relational Database Management System) software like Oracle or Sybase that stored the common data used by all the client PCs. This worked great so long as the number of PCs remained under about 30. We tried this at Amoco in the early 1990s, and it was like painting the Eiffel Tower. As soon as we got the 30th PC working, we had to go back and fix the first one! It was just too hard to keep the “thick client” software up and running on all those PCs with all the other software running on them that varied from machine to machine.
These problems were further complicated by the rise of computer viruses in the mid-1980s. Prior to the 2-tier client/server architecture, many office PCs were standalone machines, only connected to mainframes as dumb terminals, and thus totally isolated machines safe from computer virus infection. In the PC topology of the 1980s, computer viruses could only spread via floppy disks, which severely limited their infection rates. But once the 2-tier architecture fell into place, office PCs began to be connected together via LANs (Local Area Networks) and WANs (Wide Area Networks) to share data and other resources like printers. This provided a very friendly environment for computer viruses to quickly spread across an entire enterprise, so the other thing that office PCs began to share was computer viruses. Like the rest of you, I now spend about $40 per year with my favorite anti-virus vendor to protect my own home PC, and every Friday, my corporate laptop does its weekly scan, during which I suffer very sluggish response time the entire day. Over the years, I have seen these weekly scans elongate in time, as more and more viruses must be scanned for. My weekly scans have gone from about an hour, ten years ago, to nearly 7 hours today. It seems that even Intel cannot increase processor speeds as fast as new parasitic forms of software emerge! At this rate, in ten years my laptop will take a full week to run its weekly scan! Computer viruses are purely parasitic forms of software, which will be more fully covered in future postings on Self-Replicating Information and Software Symbiogenesis.
The limitations of the 2-tier architecture led to the 3-tier model in the mid to late 1990s with the advent of “middleware” (Figure 10). Middleware is software that runs on servers between the RDBMS servers and the client PCs. In the 3-tier architecture, the client PCs run “thin client” software that primarily displays information via a GUI like Windows. The middleware handles all the business logic and relies on the RDBMS servers to store data.
Figure 10 – The Distributed Computing Revolution aided object-oriented architecture (click to enlarge)
In the late 1990s, the Internet exploded upon the business world and greatly enhanced the 3-tier model (Figure 10). The “thin client” running on PCs now became a web browser like Internet Explorer. Middleware containing business logic was run on Application servers that produced dynamic web pages that were dished up by Web servers like Apache. Data remained back on mainframes or RDBMS servers. Load balancers were also used to create clusters of servers that could scale load. As your processing load increased, all you had to do was buy more servers for each tier in the architecture to support the added load. This opened an ecological niche for the middleware software that ran on the Appserver tier of the architecture. At the time, people were coming up with all sorts of crazy ways to create dynamic HTML web pages on the fly. Some people were using Perl scripts, while others used C programs, but these all required a new process to be spawned each time a dynamic web page was created and that was way too much overhead. Then Java came crashing down like a 10 kilometer wide asteroid! Java, Java, Java – that’s all we heard after it hit in 1995. Java was the first object-oriented programming language to take on IT by storm. The syntax of Java was very nearly the same as C++, without all the nasty tricky things like pointers that made C++ and C so hard to deal with. C++ had evolved from C in the 1980s, and nearly all computer science majors had cut their programming teeth on C or C++ in school, so Java benefited from a large population of programmers familiar with the syntax. The end result was a mass extinction of non-Java based software on the distributed computing platform and the rapid rise of Java based applications like an impactor mass extinction. Even Microsoft went Object-Oriented on the Windows server platform with its .NET Framework using its Java-like C# language. Procedural, non-Object Oriented software like COBOL, sought refuge in the mainframes where it still hides today.
Figure 11 – A modern multi-tier website topology (click to enlarge)
SOA - Service Oriented Architecture Period (2004 – Present)
Currently we are entering the Service Oriented Architecture (SOA) Period, which is very similar to the Cambrian Explosion. During the Cambrian Explosion, 541 million years ago, complex body plans first evolved, which allowed cells in multicellular organisms to make RMI (Remote Method Invocation) and CORBA (Common Object Request Broker Architecture) calls upon the cells in remote organs to accomplish biological purposes. In the Service Oriented Architecture Period, we are using common EJB components in J2EE appservers to create services that allow for Applications with complex body plans. The J2EE appservers perform the functions of organs like kidneys, lungs and livers. I am discounting the original appearance of CORBA in 1991 here as a failed precursor, because CORBA never became ubiquitous as EJB seems to be heading. In the evolution of any form of self-replicating information, there are frequently many failed precursors leading up to a revolution in technology.
There is a growing body of evidence beginning to support the geological "Snowball Earth" hypothesis that the Earth went through a period of 100 million years of extreme climatic fluctuations just prior to the Cambrian Explosion. During this period, the Earth seesawed between being completely covered with a thick layer of ice and being a hot house with a mean temperature of 140 0F. Snowball Earth (2003) by Gabrielle Walker is an excellent book covering the struggles of Paul Hoffman, Joe Kirschvink, and Dan Schrag to uncover the evidence for this dramatic discovery and to convince the geological community of its validity. It has been suggested that the resulting stress on the Earth's ecosystems sparked the Cambrian Explosion. As we saw above, for the great bulk of geological time, the Earth was dominated by simple single-celled organisms. The nagging question for evolutionary biology has always been why did it take several billion years for complex multicellular life to arise, and why did it arise all at once in such a brief period of geological time? Like our landfill example above, as a field geologist works up from pre-Cambrian to Cambrian strata, suddenly the rocks burst forth with complex fossils where none existed before. In the Cambrian, it seems like beer cans appeared from nothing, with no precursors. For many, the first appearance of complex life just following the climatic upheaval of the Snowball Earth is compelling evidence that these two very unique incidents in the Earth’s history must be related.
Similarly for IT, the nagging question is why did it take until the first decade of the 21st century for the SOA Cambrian Explosion to take place, when the first early precursors can be found as far back as the mid-1960s? After all, software based upon multicellular organization, also known as object-oriented software, goes all the way back to the object-oriented language Simula developed in 1965, and the ability for objects (cells) to communicate between CPUs arose with CORBA in 1991. So all the precursors were in place nearly 20 years ago, yet software based upon a complex multicellular architecture languished until it was jarred into existence by a series of harsh environmental shocks to the IT community. It was the combination of moving off the mainframes to a distributed hardware platform, running on a large number of servers and client PCs, the shock of the Internet upon the business world and IT, and the impact of Sun’s Java programming language, that ultimately spawned the SOA (Service Oriented Architecture) Cambrian Explosion we see in IT today. These shocks all occurred within a few years of each other in the 1990s, and after the dust settled, IT found itself in a new world of complexity.
Today, Service Oriented Architecture is rapidly expanding in the IT community and is beginning to expand beyond the traditional confines of corporate datacenters, as corporations begin to make services available to business partners over the Internet. With the flexibility of Service Oriented Architecture and the Internet, we are beginning to see the evolution of an integrated service oriented ecology form - a web of available services like the web of life in a rain forest.
To see how this works, let’s examine more closely the inner workings of a J2EE Appserver. Figure 12 shows the interior of a J2EE Appserver like WebSphere. The WebSphere middleware is software that runs on a Unix server which might host 30 or more WebSphere Appserver instances and there might be many physical Unix servers running these WebSphere Appserver instances in a Cell (Tier). Figure 11 shows a Cell (Tier 2) consisting of two physical Application servers or nodes, but there could easily be 4 or 5 physical Unix servers or nodes in a WebSphere Cell. This allows WebSphere to scale, as your load increases, you just add more physical Unix servers or nodes to the Cell. So each physical Unix server in a WebSphere Cell contains a number of software Appserver instances as shown in Figure 11, and each Appserver contains a number of WebSphere Applications which do things like create dynamic web pages for a web-based application. For example, on the far left of Figure 12 we see a client PC running a web browser like Internet Explorer. The web browser makes HTTP requests to an HTTP webserver like Apache. If the Apache webserver can find the requested HTML page, like a login page, it returns that static HTML page to the browser for the end-user to fill in his ID and PASSWORD. The user’s ID and PASSWORD are then returned to the Apache webserver when the SUBMIT button is pressed, but now the Apache webserver must come up with an HTML page that is specific for the user’s ID and PASSWORD like a web page with the end-user’s account information. That is accomplished by having Apache forward the request to a WebSphere Application running in one of the WebSphere Appservers. The WebSphere Appserver has two software containers that perform the functions of an organ in a multicellular organism. The Web Container contains instances of servlets and JSPs (Java Server Pages). A servlet is a Java program which contains logic to control the generation of a dynamic web page. JSPs are HTML pages with tags for embedded programming logic that are compiled into servlets at execution time. The servlets in the Web Container create objects and are run in a thread pool in the Web Container, like the cells in a liver or kidney. Unlike the mainframe processing of the Unstructured Period, in which a program was loaded into memory, run, and then perished, these servlets remain in memory and are continuously reused by the thread pool to service additional requests, until no further requests arrive and the servlet is destroyed to make room for another servlet in the thread pool. The EJB Container performs a similar function by running EJBs (Enterprise Java Beans) in a thread pool. The EJBs provide business logic and connect to databases (DB) and mainframes (EIS – Enterprise Information Systems). By keeping the servlets and EJBs running continuously in memory, with permanent connections to databases and mainframes via connection pools, the overhead of loading and releasing the servlets is eliminated as well as the creation and tear-down of connections to databases and mainframes. So the Web and EJB Containers of a J2EE Appserver are very much like the cells in an organ which continuously provide services for the other cells of a multicellular organism. Look at it this way, unlike a simple single-celled organism that is born, lives, and dies, your body consists of 100 trillion cells and each day about a trillion cells die and are replaced by a trillion new cells, but through it all you keep going. A simple single-celled organism is like a batch program from the Unstructured Period, while your body runs on a SOA architecture of trillions of cells in thread and connection pools that are constantly coming and going and creating millions of objects that are created (instantiated), used, and later destroyed.
Figure 12 – Middleware running in a J2EE Application Server (click to enlarge)
Design Patterns – the Phyla of IT
Another outgrowth of the object-oriented programming revolution was the adoption of design patterns by IT. Design patterns originated as an architectural concept developed by Christopher Alexander in the 1960s. In Notes on the Synthesis of Form (1964), Alexander noted that all architectural forms are really just implementations of a small set of classic design patterns that have withstood the test of time in the real world of human affairs and that have been blessed by the architectural community throughout history for both beauty and practicality. Basically, given the physical laws of the Universe and the morphology of the human body, there are really only a certain number of ways of doing things from an architectural point of view that work in practice, so by trial and error architects learned to follow a set of well established architectural patterns. In 1987, Kent Beck and Ward Cunningham began experimenting with the idea of applying the concept of design patterns to programming and presented their results at the object-oriented OOPSLA conference that year. Design patterns gained further popularity in computer science after the book Design Patterns: Elements of Reusable Object-Oriented Software was published in 1994 by Erich Gamma, Richard Helm, and Ralph Johnson. Also in 1994, the first Pattern Languages of Programming Conference was held, and in 1995 the Portland Pattern Repository was established to document design patterns for general IT usage.
However, the concept of design patterns goes back much further than this. In biology a design pattern is called a phylum, which is a basic body plan. For example, the phylum Arthropoda consists of all body plans that use an external skeleton such as the insects and crabs, and the Echinodermata have a five-fold radial symmetry like a starfish. Similarly, the phylum Chordata consists of all body plans that have a large dorsal nerve running down a hollow backbone or spinal column. The Cambrian Explosion, 541 million years ago, brought about the first appearance of a large number of phyla or body plans on Earth. In fact, all of the 35 phyla currently found on the Earth today can trace their roots back to the Cambrian, and it even appears that some of the early Cambrian phyla have gone completely extinct, judging by some of the truly bizarre-looking fossils that have been found in the Burgess shale of the highly experimental Cambrian period.
In IT a design pattern describes a certain design motif or way of doing things. A design pattern is a prototypical design architecture that developers can copy and adapt for their particular application to solve the general problem described by the design pattern. This is in recognition of the fact that at any given time there are only a limited number of IT problems that need to be solved at the application level, and it makes sense to apply a general design pattern rather than to reinvent the wheel each time. Developers can use a design pattern by simply adopting the common structure and organization of the design pattern for their particular application, just as living things adopt an overall body plan or phylum to solve the basic problems of existence. In addition, design patterns allow developers to communicate with each other using well-known and well understood names for software interactions, just as biologists can communicate with each other by using the well-known taxonomic system of classification developed by Carl Linnaeus in Systema Naturae published in 1735.
A design pattern that all Internet users should be quite familiar with is the Model-View-Controller (MVC) design pattern used by most web-applications. Suppose you are placing an order with Amazon. The Model is the data that comprises your Amazon account information, such as your credit card number on file and your mailing address, together with all the items in your shopping cart. In Figure 12 above, the Model is stored on a relational database server DB, such as an Oracle server, or back on a mainframe in an EIS (Enterprise Information System) connected to a mainframe DB2 database as a series of relational database tables. The View is the series of webpages presented to your browser as .html pages that convey the Model data to you in a sensible form as you go about your purchase. These View .html pages are generated by JSPs (Java Server Pages) in the web container of the J2EE Appserver. The Controller is a servlet, a java program running in a thread pool in the web container of the J2EE Appserver, that performs the overall control of your interactions with the Amazon application as you go about placing your order. The Controller servlet calls JSPs and instantiates objects (cells) that call EJB objects (cells) in the EJB container of the J2EE Appserver that interact with the relational database tables storing your data.
It has taken the IT community nearly 60 years to develop a Service Oriented Architecture based upon multicellular organization. This was achieved through a slow evolutionary process via innovation and natural selection performed by millions of independently acting programmers. Granted, this occurred much faster than the three billion years nature took to come up with the same architecture, but we could have done this back in the 1960s if we had known better – after all, the object-oriented language Simula was developed in 1965. Softwarephysics proposes that we use concepts from biology to skip to solutions directly.
Now let’s dig a little deeper and examine biological and computer software at a lower biochemical level. Before getting into the biochemistry of living things, let's briefly review the softwarechemistry of computer software. Computer software is composed of program statements, and program statements are composed of characters. Every computer language has a syntax - a set of allowed statement formats such as if, for, while which are used to perform functions. Programmers are faced with the job of assembling the 256 ASCII characters into valid program statements called source code and then sequencing the program statements in the proper order to create a routine or method to perform a given function. Routines are combined to form programs, and methods are combined to form objects, and both are combined to form applications or systems. Creating computer software is very difficult because a single erroneous character in a system composed of hundreds of thousands of lines of code can cause grave consequences.
Living things are faced with much the same problem. For this discussion, we will consider proteins to be biological software. Like computer software, proteins are composed of amino acids (program statements) which are in turn composed of atoms (characters). To form an amino acid, one combines an amino group containing nitrogen N with a carboxyl group COOH and attaches an arbitrary side group “R” to the amino acid backbone.
H H O
| | ||
A very large number of amino acids can be formed in this way, just as a very large number of program statements can be formed from 256 characters. Every programming language has a valid syntax, and nature follows one also. Of all the possible amino acids, all living things use the same 20 amino acids to build proteins. These 20 amino acids are then assembled into a chain of amino acids called a polypeptide chain. The polypeptide chain then folds up into a protein on its own to minimize the free energy of the polypeptide chain.
Biological hardware and computer hardware also share some similarities. Biological hardware is based upon the energy transitions of molecules containing carbon atoms called organic molecules, while computer hardware is currently based upon the energy transitions of electrons in silicon crystals known as integrated circuit chips. Carbon and silicon are very similar atoms. Silicon lies directly beneath carbon in the Periodic Table because both elements have four electrons in their outer shell and are also missing four electrons in their outer shell. The four missing electrons allow carbon to bind to many other atoms to form molecules rich in energy or information, and the four missing electrons of silicon make it a semiconductor, which means that silicon can be switched from a conducting state to a nonconducting state under certain circumstances to form a transistor. Transistors are currently used for the high speed switches required by the logic gates of computers. The binding energy of carbon to other atoms is just right – not too strong and not too weak, just enough to keep organic molecules together, but not too tightly. The binding energy of silicon to other atoms, like oxygen, is much too strong for living things, and that is why silicon is good for making rocks, while carbon is good for making squishy living things.
In SofwareChemistry, we saw that carbon was a singularly unique atom, in that it could form sp, sp2, and sp3 molecular orbitals to bind with other atoms. This allows carbon to form linear molecules using sp bonds, triangular sheet-like molecules using sp2 bonds, and tetrahedral-shaped molecules using sp3 bonds with two, three, or four other atoms. Similarly, silicon Si can combine with oxygen O to form a tetrahedral-shaped ionic group called silicate SiO4, which has a charge of negative four (-4), and which looks very much like the methane molecule shown in Figure 6 of SoftwareChemistry, only with silicon Si at the center bound to four oxygen O atoms. The negatively charged SiO4 tetrahedrons combine with positively charged cations of iron, magnesium, aluminum, calcium, and potassium to form silicate minerals which form about 90% of the Earth’s crust. Just like carbon, these silicate minerals can form very complicated 3-dimensional structures of repeating tetrahedra that form single chains, double chains, rings and sheets. Some have suggested that it might be possible for alien forms of life to be silicon-based, rather than carbon-based, but because of the high binding energy of silicate minerals, silicon-based chemical reactions are probably much too slow for silicon-based life forms to have evolved. However, in the Software Universe, we do find that silicon-based life has arisen, not based upon the chemical characteristics of silicon, but based upon the semiconductor characteristics of silicon.
Biological hardware uses 4 classes of organic molecules – carbohydrates, lipids, proteins, and nucleic acids.
Carbohydrates - These molecules are rich in energy and provide the energy required to overcome the second law of thermodynamics in biological processes. Living things degrade the high grade, low entropy, chemical energy stored in carbohydrates into high entropy heat energy in the process of building large complex molecules to carry out biological processes. The end result is the dumping of structural entropy into heat entropy. Similarly, computer hardware degrades high grade, low entropy, electrical energy into high entropy heat energy in order to process information.
Lipids - These molecules are also rich in energy, but certain kinds of lipids perform an even more important role. We saw in SoftwareChemistry that the water molecule H2O is a polar molecule, in that the oxygen atom O has a stronger attraction to the shared electrons of the molecule than the hydrogen atoms H, consequently, the O side of the molecule has a net negative charge, while the H side of the molecule has a net positive charge. Again, this all goes back to quantum mechanics and QED with the exchange of virtual photons that carry out the electromagnetic force. Similarly, the phosphate end of a phospholipid is also charged, and consequently, is attracted to the charged polar molecules of water, while the fatty tails of a phospholipid that carry no net charge, are not attracted to the charged polar molecules of water.
O <- Phosphate end of a phospholipid has a net electrical charge
|| <- The tails of phospholipid do not have a net electrical charge
This causes phospholipids to naturally form membranes, which are used to segregate biologically active molecules within cells and to isolate cells from their environment. Lipids perform a function similar to electrical insulation in computer hardware, or the class definition of an object in Java or C++. What happens is that the phospholipids form a bilayer with the electrically neutral tails facing each other, while the electrically charged phosphate ends are attracted to the polar water molecules inside and outside of the membrane. This very thin bilayer, only two molecules thick, is like a self-sealing soap bubble that naturally takes on a cell-like spherical shape. In fact, soap bubbles form from a similar configuration of lipids, but in the case of soap bubbles, the electrically neutral tails face outward towards the air inside and outside of the soap bubble, while the electrically charged phosphate ends face inwards. In both cases these configurations minimize the free energy of the molecules. Remember, free energy is the energy available to do work, and one expression of the second law of thermodynamics is that systems always try to minimize their free energy. That’s why a ball rolls down a hill; it is seeking a lower level of free energy. This is a key insight. Living things have learned through natural selection not to fight the second law of thermodynamics, like IT professionals routinely do. Instead, living things use the second law of thermodynamics to their advantage, by letting it construct complex structures by simply having molecules seek a configuration that minimizes their free energy. Biochemical reactions are like a rollercoaster ride. Instead of constantly driving a rollercoaster car around a hilly track, rollercoasters simply pump up the potential energy of the car by first dragging it up a large hill, and then they just let the second law of thermodynamics take over. The car simply rolls down hill to minimize its free energy. Every so often, the rollercoaster might include some booster hills, where the car is again dragged up a hill to pump it up with some additional potential energy. Biochemical reactions do the same thing. Periodically they pump up the reaction with energy from degrading ATP into ADP. The ATP is created in the Krebs cycle displayed in Figure 10 of SoftwareChemistry. Then the biochemical reactions just let nature take its course via the second law of thermodynamics, to let things happen by minimizing the free energy of the organic molecules. Now that is the smart way to do things!
Below is a depiction of a biological membrane constructed of phospholipids that essentially builds itself when you throw a phospholipid into water. For example, if you throw the phospholipid lecithin, the stuff used to make chocolate creamy, into water it will naturally form spherical cell-like liposomes, consisting of a bilayer of lecithin molecules with water inside the liposome and outside as well.
Outside of a cell membrane there are polar water molecules “+” that attract the phosphate ends of phospholipids “O”:
Inside of a cell membrane there also are polar water molecules “+” that attract the phosphate ends of phospholipids “O”, resulting in a bilayer.
Proteins - These are the real workhorse molecules that perform most of the functions in living things. Proteins are large molecules, which are made by chaining together several hundred smaller molecules called amino acids. An amino acid contains an amine group on the left containing nitrogen and a carboxyl group COOH on the right. Attached to each amino acid is a side chain “R” which determines the properties of the amino acid. The properties of the side chain “R” depend upon the charge distributions of the atoms in the side chain.
H H O
| | ||
The amino acids can chain together into very long polypeptide chains by forming peptide bonds between the amine group of one amino acid and the carboxyl group of its neighbor, releasing a water molecule in the process.
There are 20 amino acids used by all living things to make proteins.
There are four types of proteins:
1. Structural Proteins - These are proteins used for building cell structures such as cell membranes or the keratins found in finger nails, hooves, scales, horns, beaks, and feathers. Structural proteins are similar to webserver static content, such as static HTML pages and .jpg and .gif graphics files that provide structure to a website, but do not perform dynamic processing of information. However, we shall see that some structural proteins in membranes do perform some logical operations.
2. Enzymes - These are proteins that do things. Enzymes perform catalytic functions that dramatically speed up the chemical processes of life. Enzymes are like message-driven or entity EJBs that perform information-based processes. Enzymes are formed from long polypeptide chains of amino acids, but when they fold up into a protein, they usually only have a very small active site, where the side chains of a handful of amino acids in the chain do all the work. As we saw in Figure 9 of SoftwareChemistry, the charge distribution of the amino acids in the active site can match up with the charge distribution of other specific organic molecules, forming a lock and key configuration, to either bust the organic molecules up into smaller molecules or paste them together into a larger organic molecule. Thus, you can think of enzymes as a text editor that can do “cut and paste” operations on source code. These “cut and paste” operations can proceed at rates as high as 500,000 transactions per second for each enzyme molecule.
3. Hormones - These are control proteins that perform logical operations. For example, the hormone insulin controls how your cells use glucose, and large doses of testosterone or estrogen dramatically change your physical appearance. Hormones are like session EJBs that provide logical operations.
4. Antibodies - These are security proteins which protect against invading organisms. These proteins can learn to bind to antigens, proteins embedded in the cell membranes of invading organisms, and then to destroy the invaders. Antibodies are like SSL or LDAP security software.
Now let us assemble some of the above components into a cell membrane. Please excuse the line printer graphics. I was brought up on line printer graphics, and it is hard to break old habits. Plus, it saves on bandwidth and decreases webpage load times. First we insert a structural protein “@” into the phospholipid bilayer. This structural protein is a receptor protein that has a special receptor socket that other proteins called ligands can plug into.
Below we see a ligand protein “L” approach the membrane receptor protein, and we also see an inactive enzyme “I” in the cytoplasm inside of the cell membrane:
When the ligand protein “L” plugs into the socket of the membrane protein receptor “@”, it causes the protein receptor “@” to change shape and open a socket on the inside of the membrane wall. The inactive enzyme plugs into the new socket and becomes an active enzyme “A”:
The activated enzyme “A” is then released to perform some task. This is how a message can be sent to a cell object by calling an exposed method() via the membrane protein receptor “@”. Again this is all accomplished via the electromagnetic interaction of the Standard Model with the exchange of virtual photons between all the polar molecules. In all cases, the second law of thermodynamics comes into play, as the ligand, membrane receptor protein, and inactive enzyme go through processes that minimize their free energy. When the ligand plugs into the membrane receptor protein, it causes the receptor protein to change its shape to minimize its free energy, opening a socket on the interior of the cell membrane. It’s like entering an elevator and watching all the people repositioning themselves to minimize their proximity to others.
Another important role of membranes is their ability to selectively allow certain ions to pass into or out of cells, or even to pump certain ions in or out of a cell against an electrical potential gradient. This is very important in establishing the electrical potential across the membrane of a neuron, which allows the neuron to transmit information. For example, neurons use K+ and Na+ pumps to selectively control the amount of K+ or Na+ within. Below is depicted a K+ channel in the closed position, with the embedded gating proteins tightly shut, preventing K+ ions from entering the neuron:
When the K+ channel is opened, K+ ions can be pumped from the outside into the neuron:
+++++++++++++++++++@@ K @@++++++++++++++++++++++++++
|||||||||||||||||||||||||||||||||||@@ K @@||||||||||||||||||||||||||||||||||||||||||||||||
|||||||||||||||||||||||||||||||||||@@ K @@||||||||||||||||||||||||||||||||||||||||||||||||
+++++++++++++++++++@@ K @@++++++++++++++++++++++++++
So where does the information required to build these proteins come from?
Nucleic Acid - All genetic information on the planet is currently stored in nucleic acid. Nucleic acid is the Data Base Management System (DBMS) hardware used by all living things. Nucleic acid is used to store the instructions necessary to build proteins and comes in two varieties, DNA and RNA. DNA is the true data store of life, used to persist genetic data, while RNA is usually only used as a temporary I/O (Input/Output) buffer. However, there are some viruses, called retroviruses, that actually use RNA to persist genetic data too.
DNA is a very long molecule, much like a twisted ladder. The vertical sides of the ladder are composed of sugar and phosphate molecules and bound to the sides are rungs composed of bases. The bases are called Adenine (A), Cytosine (C), Guanine (G), and Thymine (T). Because the bases have different shapes, Adenosine (A) can only pair with Thymine (T), and Cytosine (C) can only pair with Guanine (G) to form a base pair rung along the length of the DNA molecule. Again, this is accomplished by forming lock and key configurations between the various 3-dimensional charge distributions on the bases, as shown in Figure 9 of SoftwareChemistry. So the beautiful double-helix structure of DNA all goes back to the strange quantum mechanical characteristics of carbon, with its four valence electrons that have p orbitals with angular momentum, allowing for complex 3-dimensional bonds to form between atoms. If the Universe did not contain carbon, or if electrons did not have angular momentum in their p orbitals, or if electrons did not have to have a unique set of quantum numbers and could all just pile up in the 1s orbital of carbon atoms, there would be no life in the Universe, and we would not be here contemplating the wonder of it all. Again, this is an example of the weak Anthropic Principle in action.
DNA is replicated in living things by unzipping the two sides of the ladder and using each side of the ladder as a template to form its mirror image. This is all done with enzyme proteins that dramatically speed up the process to form two strands of DNA from the original copy. In bacteria, DNA can be replicated at about 1000 base pairs per second, while in human cells the replication rate is only about 50 base pairs per second, but replication can proceed at multiple forks simultaneously.
RNA is very similar to DNA in structure, except that RNA uses a substitute base called Uracil (U) in place of Thymine (T) to bind to Adenine (A). Also, RNA is usually only a half ladder, it does not have a complementary side as does DNA. However, the bases along a strand of RNA can bind to each other to form complex tangled structures, leaving some bases still exposed to bind to other molecules. This allows RNA to form complicated structures with some of the properties of enzymes. Thus RNA can form a tangled structure called transfer RNA (tRNA) and combine with small proteins to form ribosomes (rRNA). Ribosomes and tRNA are used in the process that transcribes the information stored in DNA into a protein, which will be described shortly. So RNA is a much more dynamic molecule than DNA because RNA can both store genetic information and perform limited dynamical operations on organic molecules in a manner similar to enzymes. We shall see shortly, that the chief advantage of DNA is that it is a ladder with two sides, which allows for data to be persisted more accurately and allows for data recovery when there is a hardware failure within the molecule.
Computer hardware is built with semiconductors using base 2 or binary logic. The voltages on a transistor can be set so that the transistor is either conducting or nonconducting. These two states can be used to store a "1" or a "0", which we define as a bit. To define a unique bit map for all the characters and numbers we want to store, we use 8 bits, which we define as a byte. This yields 256 bit maps which can be used to code for characters. For example, in ASCII:
Character = Bits
A = 01000001
B = 00100000
C = 01000011
ASCII stands for American Standard Code for Information Interchange. Since computers can only understand numbers, an ASCII code is the numerical representation of a character such as 'A' or '@' or an action of some sort. ASCII was developed a long time ago, and now the non-printing characters 127 - 255 are rarely used for their original purpose. Below is the ASCII character set for values 0 - 127, and this includes descriptions for the first 32 non-printing characters. ASCII was actually designed for use with teletypes, like the one I used on my old DEC PDP 8/e minicomputer, and so the descriptions of the first 32 non-printing characters are now somewhat obscure, except for ACK, SYN, and NAK, still used in the TCP/IP protocol. Below are the meanings of some of the first 32 non-printing characters:
SOH – Start of Heading
STX – Start of Text
ETX – End of Text
EOT – End of Transmission
ENQ – Enquiry
ACK – Acknowledge
NAK – Negative Acknowledge
BEL – Sound the bell
BS – Backspace the printer carriage
TAB – tab the printer carriage
LF – Line Feed
CR – Carriage Return (return the printing carriage back to the left)
SYN – Synchronous Idle
ESC – Escape
SPACE – Print a blank space
The ASCII Code
Biological hardware uses base 4 logic because a DNA base can be in one of four states A,C,G, or T. A unique bit map is required for each of the 20 amino acids, and all living things use this same bit map. A two-bit byte could code for 4 x 4 = 16 amino acids, which is not enough. A three-bit byte can code for 4 x 4 x 4 = 64 possible amino acids. A three-bit byte is, therefore, the smallest possible byte, and this is exactly what nature uses. Three DNA base pairs define a biological byte, which is the code for a particular amino acid. Biologists call these three-bit bytes codons.
SER = TCA
VAL = CAG
GLN = CCT
As with ASCII, there are 3 special “non-printing” bytes, or codons, used for control purposes, called stop codons – ATT, ACT, and ATC. The functions of these stop codons will be explained later.
The Genetic Code
As you can see, since we only need to code for 20 amino acids, but have 64 possible bit maps, there is quite a bit of redundancy, in that many of the three-bit bytes code for the same amino acid. Because living things have to deal with the second law of thermodynamics and nonlinearity, just as you do as a programmer, the coding map has evolved through natural selection, so that when likely errors do occur, the same amino acid still gets mapped. So if one of the three-bit bytes coding for a protein gets fat-fingered, you still end up with the same amino acid being loaded into the polypeptide chain for the protein under construction, like Word correcting your spelling on the fly when you fat-finger some text! For example, the byte GAA coding for the amino acid leucine LEU, might suffer a mutation to the byte GAT, but the new GAT byte still maps to the amino acid leucine LEU, so no harm is done.
DNA is really a very long sequential file, which is subdivided into genes. A gene is several thousand base pairs or bits long and defines the amino acid sequence of a particular protein. To build a protein, the DNA file is opened (the molecule literally opens along the base pair bonds), and the DNA bits are transcribed to an I/O buffer called messenger RNA (mRNA). The mRNA bases bind to the DNA bases, forming a complementary mirror image, according to the pairing rules A-U C-G. This proceeds until one of the previously mentioned stop codons is encountered,which ends the transcription process. The mRNA then carries the code of the DNA molecule into the cytoplasm for further processing by spherical clumps of nucleic acid called ribosomes. Ribosomes perform a function similar to the read/write head of a Turing machine. The ribosomes sequentially read the bytes along an mRNA strand and output a polypeptide chain (protein). This is accomplished by sequentially aligning amino acid charged transfer RNA (tRNA) according to the sequence on the mRNA strand. Parallel processing is achieved by pipelining multiple ribosomes simultaneously down the same strand of mRNA. The entire process is conducted by a set of specific enzymes at a rate of about 30 amino acids per second. The end result is a polypeptide chain of amino acids.
The polypeptide chain then folds itself up into a protein, as it tries to minimize its free energy in accordance with the second law of thermodynamics. For example, an enzyme protein might fold itself up into a wrench shaped molecule, with an active site exposed for other molecules to plug into. The electrical charge distribution of the “R” side chains of the amino acids in the polypeptide chain at the active site will do all the work. Once again, see how living things do not fight the second law of thermodynamics, but use it to self-assemble parts. It’s like throwing a bunch of iron filings into a bag and having them self-assemble into a wrench, or throwing a bunch of scrabble tiles on the floor and having them self-assemble into perfect source code!
As you might expect, DNA is very sensitive to the second law of thermodynamics. Sometimes a DNA molecule will drop or add a base pair bit within a gene. This is usually fatal to the host cell because the word alignment of the gene bytes gets shifted, and all of the amino acids coded by the rest of the gene are incorrect. A usually less fatal mutation occurs when a bit changes from one state to another. In this case, the coded protein will be produced with only one amino acid in error. The shape of the erroneous protein may be close enough to the correct protein that it can serve the same function. However, such a substitution mutation can also have serious consequences. In each cell in your bladder, there is a gene with a byte containing the coding sequence below. It has been found that the alteration of one bit from G to T in the byte will produce a protein which causes the host cell to become malignant. Genes with this property are known as oncogenes.
NORMAL BYTE OR CODON
CANCER BYTE OR CODON
Next we need to discuss how the information encoded in DNA genes is transcribed into a protein in greater detail. The DNA bases in a gene, used to code for the sequence of amino acids in a protein, are called the operon of the gene. Just upstream from the operon is a section of DNA which is used to initiate mRNA transcription and to identify the gene. These leading bases are called the promoter of the gene, and their purpose is to signal the onset of the gene. The enzyme which builds the mirror image mRNA strand from the DNA template is called RNA polymerase. The promoter consists of a sequence of bases, which provides RNA polymerase a section of DNA to grip onto and initiate transcription. For example, the promoter for E. coli bacteria consists of two strings, "TTGACA" and "TATATT". The first string is located about 35 bits upstream of the codon portion of E. coli genes, while the second string is located about 10 bits upstream. Just past the promoter for the gene is a section of DNA called the operator. The operator is several bits long and its sequence of bases uniquely identifies the gene, like the key field on a record in a sequential file. When E. coli bacteria have enough of a particular protein, a repressor protein will begin to grab onto the operator for the gene of the protein no longer needed. The repressor protein derails the RNA polymerase before transcription can begin and essentially turns the gene off. Trailing the codon portion of genes is a sequence of DNA bases which signals the end of the gene, a stop codon. The DNA bases between the end of gene sequence of one gene and the promoter for the next gene consists of "junk" spacer DNA, which apparently does not encode information.
As we have seen, DNA is very fragile to mutation. One bad bit can be fatal to the host organism. Nature uses several techniques to protect the vital DNA, which are similar to techniques independently developed to protect computer data. To illustrate this, we shall compare the storage of data in DNA with standard magnetic tape. So let us take a nostalgic trip back to 1964 and the IBM System/360 (Figures 4 and 5), which introduced standard 9 track magnetic tape. These tapes were ½ inch wide and 2400 feet long and were stored on a reel. The tape had 9 separate tracks, 8 tracks were used to store data and one track, the parity track, was used as a check. The 8 bits of a byte were stored across the eight data tracks of the tape. Odd parity was used to check the 8 bits of a byte. If the 8 bits added up to an even number of 1s, then the parity bit was set to "1" which was the odd complement of the even number of 1s. If the 8 bits added up to an odd number of 1s, then the parity bit was set to "0". This allowed the computer to determine if one of the bits accidentally changed states from a "1" to "0" or vice versa.
|101101100| <- 8 data tracks followed by one parity track on far right
|011101100| the sum of all the 1s across the 9 tracks is an odd number
|100011101| <- 8 bits = 1 byte
|101001001| <- Multiple Blocked Records
|011001101| < End of Block
| | <- Inter Record Gap = 0.60 inches
|101101100| <- Start of Next Block (Begins with the Record Key Field like a Social Security Number)
Computer data was normally stored as records, a collection of data bytes coding for characters which had some particular meaning, such as the payroll data about a particular employee. To improve I/O, the records were normally blocked together into blocks of records, which could be read and written with one I/O operation by the tape drive. Between the blocks of records on a tape there was a 0.6 inch inter-record gap, which segregated blocks of records. The tapes could be read forwards or backwards, under program control, to allow the programmer to get to particular groups of records, as can be seen in old science fiction movies of computers with blinking lights and tape drives rapidly spinning back and forth in the background. To identify a particular record, a unique key field was frequently used for the first few bytes of each record. For example, the first 9 bytes of a payroll record might be an employee's social security number. This key field would uniquely identify the data on the record as belonging to a particular employee.
Typically, a programmer could access a tape with the following JCL (Job Control Language) statements.
//SYSUT1 DD DSN=DECEMBER.PAYROLL.MASTER,VOL=SER=111111,
This JCL explains that the DSN or DataSet Name of the data on the tape is the DECEMBER.PAYROLL.MASTER and that the Volume Serial number that appears on the outside of the tape reel is “111111”, which allows the computer operator to fetch the proper tape. The UNIT=TAPE lets everybody know this is a tape file and not a file on a disk drive. The DISP=OLD means that the data on the tape already exists, and we do not need a fresh blank tape to write on. The RECFM=FB means that this tape is blocked. The LRECL record length of individual records in a block is 80 bytes. The block size BLKSIZE is 8000 bytes, so that means there are 100 records in each block of data on the tape. The story goes that JCL was developed by IBM one weekend as a temporary job control language, until a “real” job control language could be written for the new System/360 operating system. However, the above JCL will still work just fine at any datacenter running the IBM z/OS operating system on a mainframe.
Originally, 9 track tapes had a density of 1600 bytes/inch of tape, with a data transfer rate of 15,000 bytes/second. Later, 6250 bytes/inch tape drives became available, with a maximum data capacity of 170 megabytes for a 2400 ft reel blocked at 32,767 bytes per block. Typically, much smaller block sizes, such as 4K (4,096 bytes) were used, in which case the storage capacity of the tape was reduced by 33% to 113 megabytes. Not too good, considering that you can now buy a PC disk drive that can hold 2,000 times as much data for about $100.
Data storage in DNA is very similar to magnetic tape. DNA uses two tracks, one data track and one parity track. Every three biological bits (bases) along the data track forms a biological byte, or codon, which is the code for a particular amino acid. The biological bits are separated by 3.4 Angstroms along the DNA molecule, yielding a data density of about 25 megabytes/inch. Using this technology, human cells store 2 billion bytes of data on 7 feet of DNA. The 100 trillion cells in the human body manage to store and process 200,000 billion gigabytes of data stored on 700 trillion feet of DNA, enough to circle the Earth 5.3 million times.
|-CG-| <- One Data Track on Left, With Corresponding Parity Track on Right
|-GC-| | 3 Bits = 1 Byte
|-TA-| <- Multiple blocked genes
|-TA-| <- End of Gene Block
|-CG-| <- Inter Gene Gap of Junk DNA
|-GC-| <- Start of Next Gene Block
|-AT-| <- Gene Block Key Field (Promoter and Operator)
|-AT-| <- First Gene of Block (Operon)
The higher forms of life (eukaryotes) store DNA in the nucleus of the cell folded tightly together with protective structural proteins called histones to form a material known as chromatin out of which the structures known as chromosomes form. There are 6 histone proteins and DNA is wound around these proteins like magnetic tape was wound around tape spools in the 1960s and 1970s. As you can see, these histone proteins are critical proteins, and all eukaryotic forms of life, from simple yeasts to peas, cows, dogs, and people use essentially the same histone proteins to wind up DNA into chromatin. The simpler forms of life, such as bacteria, are known as prokaryotes. Prokaryotes do not have cell nuclei, their DNA is free to float in the cell cytoplasm unrestrained. Because the simple prokaryotes are basically on their own and have to react quickly to an environment beyond their control, they generally block genes together based upon functional requirements to decrease I/O time. For example, if it requires 5 enzymes to digest a particular form of sugar, bacteria would block all 5 genes together to form an operon which can be transcribed to mRNA with one I/O operation. The genes along the operon gene block are variable in length and are separated by stop codons or end-of-gene bytes. The stop codons cause the separate proteins to peel off as the ribosome read/write head passes over the stop codons on the mRNA.
Eukaryotic cells, on the other hand, generally perform specific functions within a larger organism. They exist in a more controlled environment and are therefore less I/O intensive than the simpler prokaryotes. For this reason, eukaryotes generally use unblocked genes. For both eukaryotes and prokaryotes, each gene or block of genes is separated by an inter-gene gap, composed of nonsense DNA, similar to the inter-record gap of a 9 track tape. Whether the operon block consists of one gene or several genes, it is usually identified by a unique key field called the operator (recall that the operator lies just past the promoter for the gene), just as the records on a 9 track tape use a key field, like a social security number, to uniquely identify a record on the tape. The operator is several bytes long and its sequence of bases uniquely identifies the gene. To access genes, cells attach or detach a repressor molecule to the operator key field to turn the gene on or off.
Nature also uses odd parity as a data check. If a bit on the data track is an A, then the parity bit will be a T, the complement of A. If the data bit is a G, then the parity bit will be a C. Using only two tracks decreases the data density of DNA, but it also improves the detection and correction of parity errors. In computers, when a parity error is detected, the data cannot be restored because one cannot tell which of the 8 data bits was mutated. Also, if two bits should change, a parity error will not even be detected. By using only one data track and an accompanying parity track, parity errors are more reliably detected, and the data track can even be repaired based upon the information in the parity track. For example, during DNA replication, three enzymes are used to check, double check, and triple check for parity errors. Detected parity errors are corrected at the time of replication by these special enzymes. Similarly, if the data track of a DNA molecule should get damaged by radiation from a cosmic ray, the DNA molecule can be repaired by fixing the data track based on the information contained on the parity track. There are enzyme proteins which are constantly running up and down the DNA molecule checking for parity errors. When an error is detected, the error is repaired using the information from the undamaged side of the molecule. This works well if the error rates remain small. When massive doses of radiation are administered, the DNA cannot be repaired fast enough, and the organism dies.
Molecular biologists are currently troubled by the complexity of DNA in eukaryotic cells. As we have seen, prokaryotic cells usually block their genes and have a unique key field (operator) for each gene block (operon). Eukaryotic DNA data organization is apparently much more complicated. Eukaryotes do not usually block genes. They generally locate each gene on a separate block, separated by inter-gene gaps. The internal structure of the eukaryotic gene is also peculiar. Eukaryotic genes are composed of executable DNA base sequences (exons), used to code for amino acid sequences, which are interrupted by non-executable base sequences of "nonsense DNA" called introns. Introns range from 65 bits to 100,000 bits in length, and a single gene may contain as many as 50 introns. In fact, many genes have been found to contain more intron bases than exon bases. This explains why eukaryotic genes are so much longer than prokaryotic genes. Much of the DNA in eukaryotic genes is actually intron DNA, which does not encode information for protein synthesis.
When a eukaryotic gene is transcribed, both the exons and introns of the gene are copied to a strand of pre-mRNA. The beginning of an embedded intron is signaled by a 17 bit string of bases, while the end of the intron is signaled by a 15 bit string. These sections of intron mRNA must be edited out of the pre-mRNA, before the mRNA can be used to synthesize a protein. The editing is accomplished in the nucleus by a structure called a spliceosome. The spliceosome is composed of snRNP's (pronounced "SNURPS"), whose function is to edit out the introns and splice the exons together to form a strand of mRNA suitable for protein synthesis. Once the editing process is complete, the spliceosome releases the mRNA, which slips through a nuclear pore into the cell cytoplasm where the ribosomes are located.
+ snRNP's in Spliceosome ----> mRNA with Introns Removed
The following section is from an internal paper I wrote at Amoco in March of 1987 and is unchanged from the original paper. Some of these ideas have proven true in the opinion of some molecular biologists, but the role of introns still remains a mystery.
Molecular biologists are puzzled as to why perhaps 50% of the DNA in a eukaryotic gene is composed of intron "nonsense DNA". I find it hard to believe that such an inefficient data storage method would be so prevalent. Perhaps some clues from classical data processing would be of assistance. Programmers frequently include non-executable statements in programs to convey information about the programs. These statements are called comment statements and are usually bracketed by some defining characters such as /* */. These characters tell the compiler or interpreter to edit out, or skip over, the comment statements since they are non-executable. The purpose of the comment statements is to convey information to programmers about the program, while normal program statements are used to instruct the performance of the computer. Another example of non-executable information is the control information found within relational tables. When one looks at the dump of a relational table, one can easily see the data stored by the table in the form of names, locations, quantities, etc. in the dump. One can also see a great deal of "nonsense data" mixed in with the real data. The "nonsense data" is control information which is used by the relational DBMS to locate and manage the information within the relational table. Even though the control information does not contain information about mailing lists or checking accounts, it is not "nonsense data". It only appears to be "nonsense data" to programmers who do not understand its purpose. When molecular biologists examine a "dump" of the bases in a eukaryotic gene, they also can easily identify the bases used to code for the sequence of amino acids in a protein. Perhaps the "nonsense" bases they find within the introns in the "dump" might really represent biological control information for the gene. For example, the introns might store information about how to replicate the gene or how to fold it up to become part of a chromosome.
Programmers also use comment statements to hide old code from the compiler or interpreter. For example, if a programmer changes a program, but does not want to throw away the old code, he will frequently leave the code in the program, but surround it with comment characters so that it will be skipped over by the compiler or interpreter. Perhaps introns are nature's way of preserving hard fought for genetic information, which is not currently needed, but might prove handy in the future. Or perhaps introns represent code which may be needed at a later stage of development and are involved with the aging process. The concept that introns represent non-executable control information, which is not used for protein synthesis, but is used for something we are currently unaware of seems to make more sense from the standpoint of natural selection, than does the concept of useless "nonsense DNA".
------------------------/*Keep this code in case we need it later*/------------------------------------------
One Final Very Disturbing Thought
Let me conclude our IT analysis of softwarebiology with one troubling problem. There is something definitely wrong with the data organization of complex living things, like you and me, that are composed of eukaryotic cells. Prokaryotic bacterial cells are the epitome of efficiency. As we have seen, prokaryotic bacteria generally block their genes, so that the genes can be read with a single I/O operation, while eukaryotic cells do not. Also, eukaryotic cells have genes containing large stretches of intron DNA, which sometimes is used for regulation purposes, but is not fully understood at this point. Eukaryotic cells also contain a great deal of junk DNA between the protein-coding genes, which is not used to code for proteins at all. Each human cell contains about 3 billion base pairs that code for about 20,000 – 25,000 proteins, but fully 98% of these 3 billion base pairs apparently code for nothing at all. It is just junk DNA in introns and in very long gaps between the protein-coding genes. Frequently, you will find the same seemingly mindless repeating sequences of DNA base pairs over and over in the junk DNA, going on and on, for hundreds of thousands of base pairs. Prokaryotic bacteria, on the other hand, are just the opposite. Every base pair of DNA in bacteria has a purpose. Now it takes a lot of time and energy to replicate all that junk DNA when a human cell divides and replicates; in fact it takes many hours to replicate, and it is the limiting factor in determining how quickly human cells can actually replicate. On the other hand, prokaryotic bacteria are champs at replicating. For example, E. coli bacteria come in several strains that have about 5 million base pairs of DNA, containing about 5,000 genes coding for proteins. As with all bacteria, the DNA in E. coli is just one big loop of DNA, and at top speed it takes about 30 minutes to replicate the 5 million base pairs of DNA. But E. coli can replicate in 20 minutes. How can they possibly do that? Well, when an E. coli begins to replicate its loop of DNA into two daughter loops, each of the two daughter loops also begin to replicate themselves before the mother E. coli even has a chance to finish dividing! That is how they can compress a 30 minute process into a 20 minute window. Now try that trick during your next tight IT maintenance window!
So here is the dilemma. Simple prokaryotic bacteria are the epitome of good IT database design, while the eukaryotic cells used by the “higher” forms of life, like you and me, are an absolute disaster from a database design perspective! They certainly would never pass a structured database design review. The question would constantly come up, “Why in the world would possibly want to do that???”. But as Ayn Rand cautioned, when things do not seem to make sense, be sure to “check your premises”. The problem is that you are looking at this entirely from an anthropocentric point of view. In school you were taught that your body consists of 100 trillion cells, and that these cells use DNA to create proteins that you need to replicate and operate your cells. But as Richard Dawkins explains in the The Selfish Gene (1976), this is totally backwards. We do not use genes to protect and replicate our bodies; genes use our bodies to protect and replicate genes! We are DNA survival machines! Darwin taught us that natural selection was driven by survival of the fittest. But survival of the fittest what? Is it survival of the fittest species, species variety, or possibly the fittest individuals within a species? Dawkins notes that none of these things actually replicate, not even individuals. All individuals are genetically unique, so they never truly replicate. What does replicate are genes, so for Dawkins, natural selection operates at the level of the gene. These genes have evolved over time to team up with other genes to form bodies or DNA survival machines, that protect and replicate DNA, and that is why the higher forms of life are so “inefficient” when it comes to DNA. The DNA in higher forms of life is not trying to be “efficient”, it is trying to protect and replicate as much DNA as possible. Prokaryotic bacteria are small DNA survival machines that cannot afford the luxury of taking on any “passenger” junk DNA. Only large multicellular cruise ships like us can afford that extravagance. If you have ever been a “guest” on a small sailing boat, you know that there is no such thing as a “passenger” on a small sailboat; it's always "all hands on deck" - and that includes the "guests"! Individual genes have been selected for one overriding trait, the ability to replicate, and they will do just about anything required to do so, like seeking out other DNA survival machines to mate with and rear new DNA survival machines. In Blowin’ in the Wind Bob Dylan asked the question, ”How many years can a mountain exist; Before it's washed to the sea?”. Well, the answer is a few hundred million years. But some of the genes in your body are billions of years old, and as they skip down through the generations largely unscathed by time, they spend about half their time in female bodies and the other half in male bodies. If you think about it, all your physical needs and desires are geared to ensuring that your DNA survives and gets passed on, with little regard for you as a disposable DNA survival machine - truly one of those crescent Moon epiphanies! I strongly recommend that all IT professionals read the The Selfish Gene, for me the most significant book of the 20th century, because it explains so much. For a book written in 1976, it makes many references to computers and data processing that you will find extremely interesting. Dawkins has written about a dozen fascinating books, and I have read them all, many of them several times over. He definitely goes on the same shelf as Copernicus, Galileo, Newton, and Darwin for me.
Because of all the similarities we have seen between biological and computer software, resulting from their common problems with the second law of thermodynamics and nonlinearity and their similar convergent historical evolutionary paths to solve those problems with the same techniques, we need to digress a bit before proceeding and ask the question - is there something more profound afoot? Why are biological and computer software so similar? Could they both belong to a higher classification of entities that face a commonality of problems with the second law of thermodynamics and nonlinearity? That will be the subject of the next posting, which will deal with the concept of self-replicating information. Self-replicating information is information that persists through time by making copies of itself or by enlisting the support of other things to ensure that copies of itself are made. We will find that the DNA in living things, Richard Dawkins’ memes, and computer software are all examples of self-replicating information, and that the fundamental problem of software, as outlined in the three laws of software mayhem, might just be the fundamental problem of everything.
Comments are welcome at firstname.lastname@example.org
To see all posts on softwarephysics in reverse order go to: