The next few postings may do some damage to your anthropocentric inclinations, so at all times please keep in mind that you are an intelligent being in a universe that has become self-aware. What a tremendous privilege! Indeed, this is a precious and unique privilege that should be cherished each day and treated with reverence and respect and never squandered in any way. In this posting, we will begin to steer away from the topics in softwarephysics that rely mainly on physics and begin to move in the direction of the biological aspects of softwarephysics instead. It will basically be a review of high school biology, but from an IT perspective.
Living things are very complicated and highly improbable things, so they are clearly very low in entropy (disorder) and rich in information. How can this be in a Universe subject to the second law of thermodynamics, that requires entropy to always increase whenever a change is made? The answer to this question is that living things are great masters at excreting entropy, by dumping entropy into heat energy, and they learned how to do this through Charles Darwin’s marvelous effective theory of evolution, which is based upon the concepts of genetic variation and natural selection. It may seem ironic, but we shall soon see that low entropy living things could not even exist if there were no second law of thermodynamics!
How to Dump Entropy
To see how this can be accomplished let us return to our poker game analogy and watch the interplay of entropy, information, and heat energy. Again, I am exclusively using Leon Brillouin’s concept of information that defines a change in information as the difference between the initial and final entropies of a system after a change has been made:
∆I = Si - Sf
Si = initial entropy
Sf = final entropy
and the entropy S of a system is defined as the number of microstates that define a given macrostate of a system:
S = k ln(N)
The entropy S of a system is a measure of the amount of disorder in the system. We can arbitrarily set k = 1 for the sake of simplicity for our IT analysis, because we are not comparing our units of entropy S with the entropy of chemical or physical reactions, so we don’t need to use Boltzmann’s constant k to make things come out right on both sides of the equation.
Now in The Demon of Software we used a slightly modified version of poker to illustrate the above concepts. Specifically, I modified poker so that all hands of a certain rank carried the same value and thus represented a single macrostate, and each particular hand represented a single microstate. For example, there are 54,912 possible microstates, or hands, that constitute the macrostate of holding three of a kind. So we can calculate the amount of information in a three of a kind as:
There are a total of 2,598,964 possible poker hands, yielding an initial entropy of:
Si = ln (2,598,964) = 14.7706235
The entropy of three of a kind is:
Sf = ln (54,912) = 10.9134872.
So the net information contained in a three of a kind is:
∆I = Si - Sf
∆I = 14.7706235 - 10.9134872 = 3.8571363
Now let me bend the rules of poker a little more. After you are dealt five cards by the dealer, I am going to let you rifle through the dealer’s deck, while his back is turned, to look for the cards that you need. Suppose that your first five cards are worthless garbage not good for anything , but before the dealer catches you, you manage to swap three cards with him, so that you now have three of a kind. What have you done? Notice that you managed to decrease the entropy of your hand and increase the information content at the same time, in apparent violation of the second law of thermodynamics – just like Maxwell’s Demon. Of course, you really did not violate the second law of thermodynamics; you merely decreased the entropy and increased the information content of your hand by dumping some entropy elsewhere. First of all, you quickly swapped three cards with the dealer’s deck, while his back was turned. That took some thought on your part and required the degradation of some low entropy chemical energy in your brain into high entropy heat energy, and if you did this at a Las Vegas gaming table, the stress alone would have generated quite a bit of heat energy in the process! Also, there was the motion of your hands across the table and the friction of the cards sliding by, which all created heat and contributed to a net overall entropy increase of the Universe. So with the availability of a low entropy fuel, it is always possible to decrease the entropy of an isolated system and increase its information content at the same time by simply dumping entropy into heat. Living things manage to do this with great expertise indeed. They must assemble large complex organic molecules from atoms or small molecules known as monomers. So in a sense, the heat from your body allows you to excrete entropy in the form of heat, while you decrease its internal entropy and increase its internal information content. The moment you die, you begin to cool off, and you begin to disintegrate.
Darwin’s Theory of Evolution
It seems that Darwin’s theory of evolution is one of those things that most people know of, but few people seem to understand. This is rather surprising, because Darwin’s theory is so simple and yet so elegant, that it is definitely one of those, “Now why didn’t I think of that!” things in science. Also, as we shall soon see, most people on Earth constantly use Darwin’s theory in their everyday lives; you certainly do if you live in the United States. First of all, Darwin’s theory is not a proposition that the current biosphere of this planet arose from simpler forms of life. That is an observed fact, as we shall soon see. What Darwin’s theory does do, is to offer a brilliant explanation of how simpler forms of life could have evolved into the current biosphere we see today. At least in the United States, this idea seems to cause many religious people a great deal of concern. However, if you promise to stay with me through this section, I will try to demonstrate that it really should not.
So we have seen that living things are very complicated things rich in information and very low in entropy, in apparent contradiction to the second law of thermodynamics. So how did all this come about? I will save the origin of life for the next posting on self-replicating information, so let us begin with a very simple bacterial form of life as a starting point. How could it have evolved into more complicated things? In On the Origin of Species by Means of Natural Selection, or The Preservation of Favoured Races in the Struggle for Life (usually abbreviated to On the Origin of Species (1859), Darwin proposed:
1. Populations of living things tend to grow geometrically, while resources grow arithmetically or not at all, so populations have a tendency to outstrip their resource base causing a "struggle for existence" amongst individuals. This idea and terminology was borrowed from An Essay on the Principle of Population (1798) written by the economist Thomas Malthus, which Darwin had previously read.
2. There is always some genetic variation of characteristics within a population of individuals. Darwin noted this for many of the species that he observed on his five year voyage of the HMS Beagle 1831 – 1836, where he was the onboard naturalist.
3. Individuals within a population, who have characteristics that are better adapted to the environment, will have a greater chance of surviving and passing these traits on to their offspring.
4. Over time, the frequency of these beneficial traits will increase within a population and come to dominate.
5. As the above effects continue, new species slowly emerge from existing species through the accumulation of small incremental changes that always enhance survival.
6. Most species eventually go extinct, as they are squeezed out of ecological niches by better adapted species.
In 1838, Darwin developed the above ideas upon return from his voyage on the Beagle. Darwin called the above effects “natural selection” as opposed to the artificial selection that dog breeders conducted. Darwin noted that dog breeders could quickly create new dog breeds in just about a hundred years by crossbreeding dogs with certain desired traits. My wife and I frequently watch the annual Westminster Kennel Club dog show, and I too am amazed that all those very different looking dogs were basically bred from one original line of dog. To my untrained eye, these dogs all look like different species, and if some charlatan confidently presented a raccoon with a dye job on the competition floor, I would probably be fooled into thinking it really was some new exotic breed of dog.
Darwin’s concepts of genetic innovation and natural selection were also seemingly borrowed from another economist that Darwin had previously read. In An Inquiry into the Nature and Causes of the Wealth of Nations (1776) Adam Smith proposed that if governments did not interfere with free trade, that an “invisible hand” would step in and create and expand a complicated national economy seemingly out of nothing. So you see, Darwin’s theory of evolution is really just an application of the 18th century idea of capitalism to biology. In a capitalistic economy there is always a struggle for existence amongst businesses caused by the competition for resources. When a business innovates, it sometimes can gain an economic advantage over its competitors, but most times the innovation fails, and the business might even go bankrupt. And about 90% of new businesses do go bankrupt and become extinct. However, if the innovation is successful, the business will thrive and expand in the marketplace. As businesses continue to innovate by slowly adding new innovations that always enhance the survival of the business, they slowly change, until eventually you cannot even recognize the original business, like IBM, which began in 1896 as the Tabulating Machine Company building punch card tabulators, and which later merged with the International Time Recording Company in 1911, which built punch card time clocks for factories. These businesses essentially evolved into a new species. At other times, the original business is so successful that it continues on for hundreds of years nearly unchanged, like a shark. So if you live in a capitalistic economy, like the United States, you see Darwin’s theory of evolution in action everyday in your economic life. Have you ever wondered who designed the very complicated, information rich, and low entropy economy of the United States? As Adam Smith noted, it is truly amazing that no matter what you need or want, there is somebody out there more than willing to supply your needs and wants, so long as you are willing to pay for them. Who designed that? Of course, nobody designed it. Given the second law of thermodynamics in a nonlinear Universe, it really is impossible to design such a complicated economic system that meets all the needs of a huge marketplace. It is an irony of history, that the atheistic communistic states of the 20th century, that believed in an intelligently designed economy, produced economies that failed, while many very conservative and religious capitalists of the 20th century advocated Darwin’s approach of innovation and natural selection to allow economies to design themselves. So the second law of thermodynamics is essential for evolution to take place, first because it is largely responsible for genetic mutations that cause genetic variation, and secondly, because it limits the available resources for living things. If food and shelter spontaneously arose out of nothing and DNA replicated perfectly at all times, we would still be very fat and happy bacteria.
A few years ago, I set myself the task of reading some of the original great works of science, like Copernicus’s On the Revolutions of the Celestial Spheres (1543), Galileo’s the Starry Messenger (1610) and Dialogue Concerning the Two Chief World Systems (1632), Newton’s Principia (1687), and Darwin’s On the Origin of Species (1859). In each of these there are moments of sheer genius that still cause the 21st century mind to take pause. For example, in the Dialogue Concerning the Two Chief World Systems, Galileo observes that when you see a very narrow crescent Moon just after sunset, when the angle between the Sun and Moon is quite small, that the portion of the Moon that is not lit directly by the Sun is much brighter than when the Moon is closer to being a full Moon. Galileo suggested that this was because, when the Moon is a crescent for us, the Earth would be a full Earth when viewed from the Moon, and the light from the full Earth would light up the lunar landscape quite brightly. This, at a time when nearly all of mankind thought that the Earth was the center of the Universe! Now I was an amateur astronomer as a teenager, having ground and polished the mirrors for two 6-inch homemade reflecting telescopes, and I must have observed hundreds of crescent Moons, but I never once made that connection! Similarly, in On the Origin of Species, Darwin suggests that if all the plants and animals on the Earth were suddenly created in one shot, how could you possibly stop evolution from immediately commencing? Why the very next moment, they would all be chomping on each other and trying to grab sunlight from their neighbors. It would be like the commotion that follows one of those moments of silence in the trading pits of the Chicago Board of Trade.
Now for those of you with an intelligent design bent, which according to surveys includes about 2/3 of Americans, think of it this way. Although nobody designed the national economy of the United States, somebody did design the legal system that makes capitalism possible. Somebody set up the laws that allow for the private ownership of property, the right to freely enter into and enforce contracts, and the legal right to transfer property from one person to another. Intelligent beings also set up laws which reign in some of the harsher strategies of capitalism, such as killing your competitors and stealing all their inventory, dumping industrial wastes on your competitor’s land, and hopefully someday, not to freely dump carbon dioxide into the atmosphere owned by everybody. Complex capitalistic economies did not spontaneously arise in the Soviet Union in the 1930s or in feudal Europe in the 10th century because their legal systems could not sustain capitalism. So if you really want to pursue the intelligent design thing, I would personally go after the physical laws of the Universe. In cosmology this is known as the strong Anthropic Principle. You see, what is driving everybody crazy just now is that the physical laws of the Universe appear as though they were designed for intelligent life! As I mentioned previously, if you change any of the 20+ constants of the Standard Model of particle physics by just a few percent, you end up with a Universe incapable of supporting life. Astrophysicist Brandon Carter first coined the term the “anthropic principle” in 1973 and came up with two versions - the weak and strong Anthropic Principles. The weak version states that intelligent beings will only find themselves in a universe capable of supporting intelligent life, while the strong version states that the Universe must contain intelligent life. The weak version might sound like a “no-brainer”, like living things will not find themselves evolving on planets incapable of supporting life, and that is why there is nobody on Mercury contemplating the Universe, but it does have some predictive capabilities too. There was an old IT tactic from the 1960s that you could use whenever you heard a really stupid idea, all you had to say was, “Sure you could do that, but then Payroll would not run”. Here is a similar example from the physical Universe. About 3 minutes after the Big Bang, 14 billion years ago, the Universe consisted of about 75% hydrogen nuclei (protons) and 25% helium-4 nuclei (two protons and two neutrons) by mass. The heavier atoms like carbon, oxygen, and nitrogen that living things are made of began to form in the cores of stars about a billion years later, as stars fused hydrogen and helium into increasingly heavier elements. When these stars blew themselves apart in supernovae, they spewed out the heavier elements that we are made of into the interstellar medium out of which our Solar System later formed. The problem is that there is no stable element with an atomic weight of 5, so you cannot fuse a proton and a helium-4 nucleus together. However, you can fuse two helium-4 nuclei into a beryllium-8 nucleus. The beryllium-8 can then collide with another helium-4 nucleus to form carbon-12. This is all accomplished by the strong interaction of particle physics that we previously studied. The problem is that beryllium-8 is very unstable, and it should break apart in a collision with another helium-4 nucleus. But in 1954, astrophysicist Fred Hoyle predicted that carbon-12 must have a resonance near the likely collision energy of helium-4 and beryllium-8 to absorb the energy of the collision. This resonance would allow the newly formed carbon-12 nucleus to temporarily exist in an excited quantum state before it radiated the excess collision energy away. Otherwise, there would be no carbon, and we would not be here worrying about how carbon came into being. When skeptical nuclear physicists looked, sure enough, they found that carbon-12 had a resonance just 4% above the rest mass energy of helium-4 and beryllium-8 to absorb the collision energy, just as Hoyle had predicted. Now it gets even stranger. It turns out that oxygen-16 also has a resonance which is 1% below the rest mass energy of carbon-12 and helium-4. Because there is no resonance above the rest mass energy of carbon-12 and helium-4 for the oxygen-16 nucleus to absorb the collision energy of carbon-12 and helium-4, it is a rare interaction, and that is why all the carbon-12 in the Universe has not been turned into oxygen-16! If that had happened, again we would not be here worrying about carbon. So something very strange indeed seems to be going on! Here are a few suggested explanations, all of which have about the same amount of supporting evidence:
1. There are an infinite number of universes forming a multiverse and intelligent beings only find themselves in universes capable of supporting intelligent life. This is the explanation that most of the scientific community seems to be gravitating towards, especially the cosmologists and string theorists, because it is beginning to look like you can construct an infinite number of universes out of the vibrating strings and membranes of string theory. See Leonard Susskind’s The Cosmic Landscape: String Theory and the Illusion of Intelligent Design (2005).
2. Lee Smolin’s Darwinian explanation found in his book The Life of the Cosmos (1997). Universes give birth to new universes in the center of black holes. It is postulated that universes can pass on a variation of their laws on to their children universes, so universes that can produce black holes will produce lots of children that can also produce black holes and will soon out compete universes with laws that do not produce lots of black holes. Thus, universes that can produce lots of black holes will come to dominate the multiverse. Since it takes a long time to produce black holes and just the right nuclear chemistry, intelligent life arises as a byproduct of black hole creation.
3. The physical Universe is a computer simulation created by other intelligent beings. See Konrad Zuse’s Calculating Space (1967) at:
or Nick Bostrom’s Are You Living in a Computer Simulation? (2002) at:
or Paul Davies' Cosmic Jackpot: Why Our Universe Is Just Right for Life (2007)
4. There really is a supreme being that created the physical laws of the Universe. This is a perfectly acceptable explanation and is favored by about 90% of Americans.
5. There is some other strange explanation we cannot even imagine, as Sir Arthur Eddington noted - ”the universe is not only stranger than we imagine, it is stranger than we can imagine.”
As an 18th century liberal and 20th century conservative, I think it is very important to keep an open mind for all explanations of the Anthropic Principle. I am very much appalled by the political correctness of both 21st century liberals and 21st century conservatives. Between the two, it is impossible these days to carry on a civil discussion of any idea whatsoever. For me, religions are just another set of effective theories that should stand on their own merit. You know me, I try not to believe in things, so I always try to assign a level of confidence to any effective theory, but at the same time, not entirely rule any of them out off hand.
It seems that some conservative religious people have difficulties with some of these issues because they put great store in certain written words. However, I would question if it is even possible to convey absolute truth symbolically, especially in words, since every word in the dictionary is defined in terms of other words in the dictionary. For example, mathematics is a marvelous way to symbolically convey information, but even mathematics has its limitations. As we have seen, all physicists agree upon the mathematics underlying quantum mechanics, but none of them seem to exactly agree on what the mathematics is trying to tell us. That is why we have the Copenhagen interpretation, the Many-Worlds interpretation, the Decoherent Histories interpretation, and John Cramer’s Transactional Interpretation of quantum mechanics. I think the same goes for written words. That is why religions frequently splinter into so many factions. Unfortunately, many wars have been fought over the interpretation of such words. After all, we are all just trying to figure out what this is all about. I suggest that we should try to do this together, with a respect for the opinions of others. You never know, they just might be “right”!
With that said, let’s get back to Darwin. It all began in 1666 when Niels Stensen, also known as Steno, received the carcass of a shark caught off the coast of Livorno in Italy. As Steno examined the shark, he was struck by how similar the shark’s teeth were to “tongue stones,” triangular pieces of rock that had been found in nearby cliffs. Steno reasoned that the local area must have been under water at one time and that the rocks of the cliffs had been deposited as horizontal layers in a shallow sea, and that is how the shark teeth got into the rocks. He also realized that the older rock layers must be near the bottom of the cliffs, while the younger layers were deposited on top of them. In geology, these ideas are now known as Steno’s Laws. Thanks to Steno’s observations, when we look at a layered outcrop along a roadside cut, we now know that the rocks near the bottom of the sequence are older than the rocks near the top. As with most of geology, this may seem like a real “no-brainer”, until you realize that before you heard of this idea, you had probably looked at hundreds of pictures of layered rock formations, but never made the connection yourself. As I mentioned in our discussion of global warming, sea level can change significantly as the Earth’s ice caps expand or contract. Also, plate tectonics can cause the uplift or subsidence of large portions of the Earth’s continental crust, and since much of this crust is very close to being at sea level, the oceans of the Earth can wash over much of the land when sea level rises. Geologists call the sea washing over the land a transgression and, when it recedes, a regression. During a transgression, sandy beach deposits may form which turn into sandstone, or deep water muddy deposits may be laid down to form shale. At intermediate depths, one might find a coral reef forming that ends up becoming limestone. During a regression, when the land is once again exposed to the air, some of these sediments will be eroded away. Thus a good geologist can look at an outcrop, and with the sole support of his trusty hand lens to magnify the sedimentary grains, come up with a complicated history of deposition and erosion, like a very good crime scene investigator – quite an impressive feat. The point is that when you look at an outcrop, you will find successive layers of sandstone, shale, and limestone, and each will contain an association of fossils that makes sense. You will find corals in the limestone reefs, broken shells from brachiopods in the sandstone deposits, and graptolite fossils in the deep water shale deposits.
Strategraphy was further enhanced in the early 19th century due to the excavation of canals on a large scale. British engineer, William Smith (1769-1839), was in charge of excavating the Somerset Canal, which required Smith to do a great deal of surveying and mapping of the rock formations along the proposed canal. Smith observed that fossils did not appear at random throughout the stratigraphic column. Instead, he found that certain fossils were always found together in an association and these associations changed from the bottom to the top of a strategraphic section as we just discussed. And this same ordering of fossil assemblages could also be seen in rock sections on the other side of England as well. As Smith described it,
”. . . each stratum contained organized fossils peculiar to itself, and might, in cases otherwise doubtful, be recognised and discriminated from others like it, but in a different part of the series, by examination of them.”
By mapping the succession of fossils found in different rock formations, Smith was able to show that living things appeared and disappeared throughout geologic time. Around the same time, George Cuvier and Alexandre Brongniart were also mapping the Paris Basin. Cuvier noticed that the more ancient fossils were, the less they resembled their present day counterparts. Thus the idea that fossils showed change throughout geological time was well accepted by the 1840s.
Look at it this way, suppose you start digging a hole into your local landfill. As you dig into the pile of garbage, you will first come across beer cans with pop-tops that stay affixed to the top of the beer can, now known as stay-tabs in the beer industry. These cans will consist of two pieces, a top piece attached to an extruded bottom which forms both the walls and the bottom of the can as one piece. As you dig down a little deeper, the beer cans with stay-tabs will disappear (1975), and instead, you will only find beer cans with holes in their tops where pull-tabs had been removed. A little deeper you will find two species of beer cans appear, one consisting of two pieces as we have already seen, and a new species consisting of three parts, a top, a cylindrical wall section, and a bottom. As you dig deeper still, the two-piece beer cans will decline and the three-piece beer cans will increase in number. By the early 1970s, the two-piece beer cans will totally disappear, and all you will find will be the three-piece species of beer cans. A little deeper yet, and you will find the beer cans with pull-tabs disappear (1963), to be replaced by beer cans with triangular puncture holes in their tops from “churchkey” can openers, and these cans will be made of steel and not aluminum. A little deeper still (1960), the flat-topped beer cans will diverge and you will again find two species of beer cans, one with flat tops and the other species will be a strange looking beer can with a cone-shaped top, known in the industry as a cone-top. On rare occasions, you will actually find one of these strange cone-top beer cans with the bottle cap still affixed to the cone-shaped top. These strange beer cans look a lot more like beer bottles than beer cans and were, in fact, run down the same bottling line at breweries as bottles, allowing the breweries to experiment with the new fangled beer cans, without having to invest in a new bottling line. As you dig deeper still, the percentage of the strange cone-top beer cans will increase and the percentage with flat tops will dwindle, until eventually you find layers of garbage from the late 1940s that only contain the strange cone-top beer cans. Still deeper, you will reach a layer of garbage that does not contain any beer cans at all (1935), but you will notice that the number of beer bottles will have increased.
Figure 1 – A cone-top beer can with attached bottle cap from the cone-top period (click to enlarge)
Now if you dig similar holes into other neighboring landfills, you will always find the same sequence of changes in beer cans as you dig down deeper. In fact, you will find the same sequence in any landfill in the country. Knowing the evolutionary history of beer cans can be of great use in dating landfills. Recent landfills will only contain two-piece beer cans with stay-tabs, while very ancient landfills that were abandoned in the 1950s, will only contain steel beer cans with puncture holes, and long-lived landfills will contain the entire sequence of beer cans. Knowing the evolution of beer cans also allows you to correlate the strata of garbage in one landfill with the strata of garbage in other landfills. If you first find pull-tab beer cans 100 feet below the surface in landfill A and 200 feet below the surface at landfill B, you know that these strata of garbage were deposited at the same time. It also allows you to date other garbage. Using beer can chronology, you can tell that hoola hoops (1957) first appeared in the late steel beer can period and just before the aluminum pull-tab (1963) period. When I explored for oil we used the same trick. By looking at the tiny marine fossils that came up with the drill bit cuttings, we could tell what the age of the rock was, as we drilled down through the stratigraphic column, and this allowed us to correlate stratigraphic layers between wells. So now you know the real reason why they call oil a fossil fuel!
As you dig in landfills across the nation, you will also notice that the garbage in each layer makes sense as an ecological assemblage. You will not find any PCs in the layers from the 1960s, but you will begin to see them in the layers from the 1980s, and in those layers you will find an association of PC chassis, monitors, keyboards, floppy disks, mice, mouse pads, external disk drives, modems, and printers. You will not find these items randomly mixed throughout all the layers of garbage from top to bottom, but you will find that all these items do gradually evolve over time as you examine garbage layers from the 1980s and 1990s. This is the same thing that William Smith noted about the fossils he uncovered while excavating for the Somerset Canal. The fact that both garbage and fossil containing sediments seem to be laid down in ecological assemblages that make sense, leads one to think that these layers were laid down in sequence over a great deal of time, following Steno’s Laws, with the older layers near the bottom and the younger layers on top. Now it is possible that somebody could have laid down 1,000 feet of garbage all at once, in just such a way as to make it appear as though it were laid down over many years, just as somebody could have laid down 30,000 feet of sediment in the Gulf of Mexico to make it appear as though it had been laid down over millions of years, complete with the diapiric salt domes, normal faults, and strategraphic traps that oil companies look for in their quest for oil and gas. This would, at first, present a great mental challenge for the oil industry, but I guess they could just continue exploring for oil as if the 30,000 feet of sediments had been deposited over many millions of years and still find oil and gas. In science, we generally apply Occam's razor to cut away all the unnecessary assumptions and just go with the simplest explanation.
So what happened? Apparently, glass beer bottles evolved into steel beer cans with cone-shaped tops that could be capped with a standard bottle cap in 1935. These beer cans were lighter than beer bottles and did not have to be returned to the brewery for refilling, a very desirable feature for both consumers and brewers. So the steel beer cans, that looked a lot like glass beer bottles, began to invade the economic niche dominated by glass beer bottles. But stacking cone-top beer cans is not so easy because the cone-shaped top has the same disadvantage as the cone-shaped tops of glass beer bottles. So one brewery, in the late 1940s, came up with the clever innovation of producing flat-topped beer cans. Of course, this required the investment in a new bottling line and a whole new manufacturing process, but the risk proved worthwhile given the competitive advantage of stacking the cans during shipment, storage in warehouses and liquor stores, and even in the customer’s refrigerator. But these flat-top beer cans required the development of a special can opener now known as a “churchkey” can opener to puncture the steel can tops. Under competitive pressure throughout the 1950s, more and more breweries were forced to adopt the flat-top beer can, so by 1960 all of the cone-top beer cans had disappeared. The cone-top beer can essentially went extinct. Aluminum was first used for frozen orange juice cans in 1960, and in 1961, Reynolds Metals Co. did a marketing survey that showed that consumers preferred the light-weight aluminum cans over the heavy traditional steel orange juice cans. The lighter aluminum cans also reduced shipping costs, and aluminum cans could be more easily crushed with the customer’s bare hands, enhancing one of the most prevalent mating displays of Homo Sapiens. So breweries began to can beer in aluminum cans in the early 1960s, but the churchkey can openers did not really work very well for the soft aluminum cans, so the pull-tab top was invented in 1963. This caused the churchkey population to crash to near extinction, as is clearly demonstrated in our landfill strategraphic sections. However, the churchkeys were able to temporarily find refuge in other economic niches opening things like quart cans of oil, fruit juice cans, and cans of tomato juice, but as those niches also began to disappear, along with their associated churchkey habitat, the population of churchkeys dropped dangerously low. Thankfully, you can still find churchkeys today, thanks to their inclusion on one of the first endangered species lists, established during the environmentally aware 1970s. The same cannot be said for the traditional flat-top beer can without a pull-tab. They went extinct in the early 1960s. In the early 1970s, breweries began to abandon the traditional three-piece aluminum cans for the new two-piece cans which were easier to fabricate. Now the pull-tab beer cans also had a problem. Customers would frequently drop the pull-tab into the beer can just before taking the first swig of beer. Under rare circumstances, a customer might end up swallowing the pull-tab! The first rule they teach you in the prestigious business schools is not to kill your customer with the product, at least not for a long time. So in 1975, the brewing industry came out with the modern two-piece beer can with a stay-tab top.
In our exploration of landfills we had one advantage over the early geologists of the 19th century. Mixed in with the garbage and the beer cans we also found discarded copies of Scientific American, the longest running magazine in the United States, with publication dates on them that allowed us to know the absolute date that each layer of garbage was laid down, and this allowed us to assign absolute time ranges to each age of beer can. Using the Scientific Americans we could tell when each type of beer can first appeared and when it went extinct. This same windfall happened for geologists early in the 20th century, when physicists working with radioactive elements discovered that radioactive atoms decayed with a specific half-life. For example, uranium-238 has a half-life of 4.5 billion years. This means that if you have a pound of uranium-238 today, in 4.5 billion years you will have ½ pound and in 9.0 billion years ¼ pound, with the remainder having turned into lead. So by measuring the ratio of uranium-238 to lead in little crystals of zircon in a sample of volcanic rock, you can determine when it solidified. To nail down the relative ages of rocks in a stratigraphic section, you look for outcrops with periodic volcanic deposits that have sedimentary layers between them. This allows you to bracket the age of the sandwiched sedimentary layers and the fossils within them. Once you have a date range for the fossils, you can now date outcrops that do not have volcanic deposits by just using the fossils.
High School Biology From an IT Perspective
Now let’s see how Darwin’s theory of evolution and entropy dumping led to the complex biosphere we see today and its corresponding equivalent in the Software Universe. Of course, I cannot cover all of high school biology in one posting, but I will try to focus on some important biological themes from an IT perspective that should prove helpful to IT professionals. All living things are composed of cells, and as we shall see, cells are really very little nanotechnology factories that build things one molecule at a time and require huge amounts of information to do so which is processed in parallel at tremendous transaction rates.
Like the physical Universe, the Software Universe is also populated by living things. In my youth, we called these living things "computer systems", but today we call them "Applications". The Applications exist by exchanging information with each other, and sadly, are parasitized by viruses and worms and must also struggle with the second law of thermodynamics and nonlinearity. Since the beginning of the Software Universe, the architecture of the Applications has evolved through a process of innovation and natural selection that has followed a path very similar to the path followed by living things on Earth. I believe this has been due to what evolutionary biologists call convergence. For example, as Richard Dawkins has pointed out, the surface of the Earth is awash in a sea of visible photons, and the concept of the eye has independently evolved more than 40 times on Earth over the past 600 million years to take advantage of them. An excellent treatment of the significance that convergence has played in the evolutionary history of life on Earth, and possibly beyond, can be found in Life’s Solution (2003) by Simon Conway Morris. Programmers and living things both have to deal with the second law of thermodynamics and nonlinearity, and there are only a few optimal solutions. Programmers try new development techniques, and the successful techniques tend to survive and spread throughout the IT community, while the less successful techniques are slowly discarded. Over time, the population distribution of software techniques changes.
As with the evolution of living things on Earth, the evolution of software has been greatly affected by the physical environment, or hardware, upon which it ran. Just as the Earth has not always been as it is today, the same goes for computing hardware. The evolution of software has been primarily affected by two things - CPU speed and memory size. As I mentioned in So You Want To Be A Computer Scientist?, the speed and memory size of computers have both increased by about a factor of a billion since Konrad Zuse built the Z3 in the spring of 1941, and the rapid advances in both and the dramatic drop in their costs have shaped the evolutionary history of software greatly.
The two major environmental factors affecting the evolution of living things on Earth have been the amount of solar energy arriving from the Sun and the atmospheric gases surrounding the Earth that hold it in. The size and distribution of the Earth’s continents and oceans have also had an influence on the Earth’s overall environmental characteristics too, as the continents shuffle around the surface of the Earth, responding to the forces of plate tectonics. For example, billions of years ago the Sun was actually less bright than it is today. Our Sun is a star on the main sequence that is using the proton-proton reaction and the carbon-nitrogen-oxygen cycle in its core to turn hydrogen into helium-4, and consequently, turn matter into energy that is later radiated away from its surface, ultimately reaching the Earth. As a main-sequence star ages, it begins to shift from the proton-proton reaction to relying more on the carbon-nitrogen-oxygen cycle which runs at a higher temperature. Thus, as a main-sequence star ages, its core heats up and it begins to radiate more energy at its surface. In fact, the Sun currently radiates about 30% more energy today than it did about 4.5 billion years ago, when it first formed and entered the main sequence. This increase in the Sun’s radiance has been offset by a corresponding drop in greenhouse gases, like carbon dioxide, over this same period of time, otherwise the Earth’s oceans would have vaporized long ago, and the Earth would now have a climate more like Venus which has a surface temperature that melts lead. Using some simple physics, you can quickly calculate that if the Earth did not have an atmosphere containing greenhouse gases like carbon dioxide, the surface of the Earth would be on average 27 0F cooler today and totally covered by ice. Thankfully there has been a long term decrease in the amount of carbon dioxide in the Earth’s atmosphere, principally caused by living things extracting carbon dioxide from the air to make carbon-based organic molecules which later get deposited into sedimentary rocks which plunge back into the Earth at the many subduction zones around the world that result from plate tectonic activities. For example, hundreds of millions of years ago, the Earth’s atmosphere contained about 10 - 20 times as much carbon dioxide as it does today. So greenhouse gases like carbon dioxide play a critical role in keeping the Earth’s climate in balance and suitable for life.
The third factor that has greatly affected the course of evolution on Earth has been the occurrence of periodic mass extinctions. In our landfill exercise above, we saw how the extinction of certain beer can species could be used to mark the stratigraphic sections of landfills. Similarly, in 1860, John Philips, an English geologist, recognized three major geological eras based upon dramatic changes in fossils brought about by two mass extinctions. He called the eras the Paleozoic (Old Life), the Mesozoic (Middle Life), and the Cenozoic (New Life), defined by mass extinctions at the Paleozoic-Mesozoic and Mesozoic-Cenozoic boundaries:
Cenozoic 65 my – present
=================== <= Mass Extinction
Mesozoic 250 my – 65 my
=================== <= Mass Extinction
Paleozoic 541 my – 250 my
Of course, John Philips new nothing of radiometric dating in 1860, so these geological eras only provided for a means of relative dating of rock strata based upon fossil content like our beer cans in the landfills. The absolute date ranges only came later in the 20th century, with the advent of radiometric dating of volcanic rock layers found between the layers of fossil bearing sedimentary rock. It is now known that we have actually had five major mass extinctions since multicellular life first began to flourish about 541 million years ago, and there have been several lesser extinction events as well. The three geological eras have been further subdivided into geological periods, like the Cambrian at the base of the Paleozoic, the Permian at the top of the Paleozoic, the Triassic at the base of the Mesozoic, and the Cretaceous at the top of the Mesozoic. Figure 2 shows an “Expand All” of the current geological time scale now in use. Notice how small the Phanerozoic is, the eon comprising the Paleozoic, Mesozoic, and Cenozoic eras in which complex plant and animal life are found. Indeed, the Phanerozoic represents only the last 12% of the Earth’s history – the first 88% of the Earth’s history was dominated by simple single-celled forms of life like bacteria.
Figure 2 – The geological time scale (click to enlarge)
Currently, it is thought that these mass extinctions arise from two different sources. One type of mass extinction is caused by the impact of a large comet or asteroid, and has become familiar to the general public as the Cretaceous-Tertiary (K-T) mass extinction that wiped out the dinosaurs at the Mesozoic-Cenozoic boundary 65 million years ago. An impacting mass extinction is characterized by a rapid extinction of species followed by a corresponding rapid recovery in a matter of a few million years. An impacting mass extinction is like turning off a light switch. Up until the day the impactor hits the Earth, everything is fine and the Earth has a rich biosphere. After the impactor hits the Earth, the light switch turns off and there is a dramatic loss of species diversity. However, the effects of the incoming comet or asteroid are geologically brief and the Earth’s environment returns to normal in a few decades or less, so within a few million years or so, new species rapidly evolve to replace those that were lost.
The other kind of mass extinction is thought to arise from an overabundance of greenhouse gases and a dramatic drop in oxygen levels, and is typified by the Permian-Triassic (P-T) mass extinction at the Paleozoic-Mesozoic boundary 250 million years ago. Greenhouse extinctions are thought to be caused by periodic flood basalts, like the Siberian Traps flood basalt of the late Permian. A flood basalt begins as a huge plume of magma several hundred miles below the surface of the Earth. The plume slowly rises and eventually breaks the surface of the Earth, forming a huge flood basalt that spills basaltic lava over an area of millions of square miles to a depth of several miles. Huge quantities of carbon dioxide bubble out of the magma over a period of several hundreds of thousands of years and greatly increase the ability of the Earth’s atmosphere to trap heat from the Sun. For example, during the Permian-Triassic mass extinction, carbon dioxide levels may have reached a level as high as 3,000 ppm, much higher than the current 385 ppm. Most of the Earth warms to tropical levels with little temperature difference between the equator and the poles. This shuts down the thermohaline conveyer that drives the ocean currents. Currently, the thermohaline conveyer begins in the North Atlantic, where high winds and cold polar air reduce the temperature of ocean water through evaporation and concentrates its salinity, making the water very dense. The dense North Atlantic water, with lots of dissolved oxygen, then descends to the ocean depths and slowly winds its way around the entire Earth, until it ends up back on the surface in the North Atlantic several thousand years later. When this thermohaline conveyer stops for an extended period of time, the water at the bottom of the oceans is no longer supplied with oxygen, and only bacteria that can survive on sulfur compounds manage to survive in the anoxic conditions. These sulfur loving bacteria metabolize sulfur compounds to produce large quantities of highly toxic hydrogen sulfide gas, the distinctive component of the highly repulsive odor of rotten eggs, which has a severe damaging effect on both marine and terrestrial species. The hydrogen sulfide gas also erodes the ozone layer by dropping oxygen levels down to a suffocating low of 12% of the atmosphere, compared to the current level of 21%, allowing damaging ultraviolet light to reach the Earth’s surface, and to beat down relentlessly upon the animal life gasping for breath in an oxygen poor atmosphere even at sea level and destroy its DNA. . The combination of severe climate change, changes to atmospheric and oceanic oxygen levels and temperatures, the toxic effects of hydrogen sulfide gas, and the loss of the ozone layer, cause a slow extinction of many species over a period of several hundred thousand years. And unlike an impacting mass extinction, a greenhouse mass extinction does not quickly reverse itself, but persists for millions of years until the high levels of carbon dioxide are flushed from the atmosphere and oxygen levels rise. In the stratigraphic section, this is seen as a thick section of rock with decreasing numbers of fossils and fossil diversity leading up to the mass extinction, and a thick layer of rock above the mass extinction level with very few fossils at all, representing the long recovery period of millions of years required to return the Earth’s environment back to a more normal state. There are a few good books by Peter Ward that describe these mass extinctions more fully, The Life and Death of Planet Earth (2003), Gorgon – Paleontology, Obsession, and the Greatest Catastrophe in Earth’s History (2004), and Out of Thin Air (2006). There is also the disturbing Under a Green Sky (2007), which posits that we might be initiating a human-induced greenhouse gas mass extinction by burning up all the fossil fuels that have been laid down over hundreds of millions of years in the Earth’s strata. In the last portion of Software Chaos, I described how you as an IT professional can help avert such a disaster.
In my next posting on Self-Replicating Information, we shall see that ideas also evolve with time. Over the past 30 years there has been such a paradigm shift in paleontology beginning with the discovery in 1980 by Luis Alvarez of a thin layer of iridium-rich clay at the Cretaceous-Tertiary (K-T) mass extinction boundary, which has been confirmed in deposits throughout the world. This discovery, along with the presence of shocked quartz grains in these same layers, convinced the paleontological community that the K-T mass extinction was the result of an asteroid or comet strike upon the Earth. Prior to the Alvarez discovery, most paleontologists thought that mass extinctions resulted from slow environmental changes that occurred over many millions of years. Similarly, within the past 10 years, there has been a similar shift in thinking for the mass extinctions that are not caused by impactors. Rather than ramping up over many millions of years, the greenhouse extinctions seem to unfold in a few hundred thousand years or less, which is a snap of the fingers in geological time.
In James Hutton’s Theory of the Earth (1785) and Charles Lyell’s Principles of Geology (1830), the principle of uniformitarianism was laid down in the early 19th century. Uniformitarianism is a geological principle that states that the “present is key to the past”. If you want to figure out how a 100 million-year-old cross-bedded sandstone came to be, just dig into a point bar on a modern day river and take a look. Uniformitarianism contends that the Earth has been shaped by slow-acting geological processes that can still be observed at work today. Uniformitarianism replaced the catastrophism of the 18th century which proposed that the geological structures of the Earth were caused by short-term catastrophic events like Noah’s flood. In fact, the names for the Tertiary and Quaternary geological periods actually come from those days! In the 18th century, it was thought that the water from Noah’s flood receded in four stages - Primary, Secondary, Tertiary and Quaternary, and each stage laid down different kinds of rock as it withdrew. Now since most paleontologists are really geologists who have specialized in studying fossils, the idea of uniformitarianism unconsciously crept into paleontology as well. Because uniformitarianism proposed that the rock formations of the Earth slowly changed over immense periods of time, so too must the Earth’s biosphere have slowly changed over long periods of time, and therefore, the mass extinctions must have been caused by slow-acting environmental changes occurring over many millions of years.
But now we have come full circle. Yes, uniformitarianism may be very good for describing the slow evolution of hard-as-nails rocks, but maybe not so good for the evolution of squishy living things that are much more sensitive to things like asteroid strikes or greenhouse gas emissions that mess with the Earth’s climate over geologically brief periods of time. Uniformitarianism may be the general rule for the biosphere, as Darwin’s mechanisms of innovation and natural selection slowly work upon the creatures of the Earth. But every 100 million years or so, something goes dreadfully wrong with the Earth’s climate and environment, and Darwin’s process of natural selection comes down hard upon the entire biosphere, winnowing out perhaps 70% - 90% of the species on Earth that cannot deal with the new geologically temporary conditions. This causes dramatic evolutionary effects. For example, the Permian-Triassic (P-T) mass extinction cleared the way for the surviving reptiles to evolve into the dinosaurs that ruled the Mesozoic, and the Cretaceous-Tertiary (K-T) mass extinction, did the same for the rodent-like mammals that went on to conquer the Cenozoic, ultimately producing a species capable of producing software.
Similarly, the evolutionary history of software over the past 2.1 billion seconds (68 years) has also been greatly affected by a series of mass extinctions, which allow us to also subdivide the evolutionary history of software into several long computing eras, like the geological eras listed above. As with the evolution of the biosphere over the past 541 million years, we shall see that these mass extinctions of software have also been caused by several catastrophic events in IT that were separated by long periods of slow software evolution through uniformitarianism.
Unstructured Period (1941 – 1972)
During the Unstructured Period, programs were simple monolithic structures with lots of GOTO statements, no subroutines, no indentation of code, and very few comment statements. The machine code programs of the 1940s evolved into the assembler programs of the 1950s and the compiled programs of the 1960s, with FORTRAN appearing in 1956 and COBOL in 1958. These programs were very similar to the early prokaryotic bacteria that appeared over 4,000 million years ago on Earth and lacked internal structure. Bacteria essentially consist of a tough outer cell wall enclosing an inner cell membrane and contain a minimum of internal structure. The cell wall is composed of a tough molecule called peptidoglycan, which is composed of tightly bound amino sugars and amino acids. The cell membrane is composed of phospholipids and proteins, which will be described later in this posting. The DNA within bacteria generally floats freely as a large loop of DNA, and their ribosomes, used to help transcribe DNA into proteins, float freely as well and are not attached to membranes called the rough endoplasmic reticulum. The chief advantage of bacteria is their simple design and ability to thrive and rapidly reproduce even in very challenging environments, like little AK-47s that still manage to work in environments where modern tanks fail. Just as bacteria still flourish today, some unstructured programs are still in production.
Figure 3 – A simple prokaryotic bacterium with little internal structure (click to enlarge)
Below is a code snippet from a fossil FORTRAN program listed in a book published in 1969 showing little internal structure. Notice the use of GOTO statements to skip around in the code. Later this would become known as the infamous “spaghetti code” of the Unstructured Period that was such a joy to support.
30 DO 50 I=1,NPTS
31 IF (MODE) 32, 37, 39
32 IF (Y(I)) 35, 37, 33
33 WEIGHT(I) = 1. / Y(I)
GO TO 41
35 WEIGHT(I) = 1. / (-1*Y(I))
37 WEIGHT(I) = 1.
GO TO 41
39 WEIGHT(I) = 1. / SIGMA(I)**2
41 SUM = SUM + WEIGHT(I)
YMEAN = WEIGHT(I) * FCTN(X, I, J, M)
DO 44 J = 1, NTERMS
44 XMEAN(J) = XMEAN(J) + WEIGHT(I) * FCTN(X, I, J, M)
The primitive nature of software in the Unstructured Period was largely due to the primitive nature of the hardware upon which it ran. Figure 4 shows an IBM OS/360 from 1964 – notice the operator at the teletype feeding commands to the nearby operator console, the distant tape drives, and the punch card reader in the mid-ground. Such a machine had about 1 MB of memory, less than 1/1000 of the memory of a current $500 PC, and a matching anemic processing speed. For non-IT readers let me remind all that:
1 K = 1 kilobyte = 210 = 1024 bytes or about 1,000 bytes
1 MB = 1 megabyte = 1024 x 1024 = 1,048,576 bytes or about a million bytes
1 GB = 1 gigabyte = 1024 x 10224 x 1024 = 1,073,741,824 bytes or about a billion bytes
One byte of memory can store one ASCII text character like an “A” and two bytes can store a small integer in the range of -32,768 to +32,767. When I first started programming in 1972 we thought in terms of kilobytes, then megabytes, and now gigabytes. Data warehousing people think in terms of terabytes - 1 TB = 1024 GB.
Software was input via punched cards and the output was printed on fan-fold paper. Compiled code could be stored on tape or very expensive disk drives if you could afford them, but any changes to code were always made via punched cards, and because you were only allowed perhaps 128K – 256K of memory for your job, programs had to be relatively small, so simple unstructured code ruled the day. Like the life cycle of a single-celled bacterium, the compiled and linked code for your program was loaded into the memory of the computer at execution time and did its thing in a batch mode, until it completed successfully or abended and died. At the end of the run, the computer’s memory was released for the next program to be run and your program ceased to exist.
However, one should not discount the great advances that were made by the early bacteria billions of years ago or by the unstructured code from the computer systems of the 1950s and 1960s. These were both very important formative periods in the evolution of life and of software on Earth, and examples of both can still be found in great quantities today. For example, it is estimated that about 50% of the Earth’s biomass is still composed of simple bacteria. Your body consists of about 100 trillion cells, but you also harbor about 10 times that number of bacterial cells that are in a parasitic/symbiotic relationship with the “other” cells of your body and perform many of the necessary biochemical functions required to keep you alive, such as aiding with the digestion of food. Your gut contains about 3.5 pounds of active bacteria and about 50% of the dry weight of your feces is bacteria, so in reality, we are all composed of about 90% bacteria with only 10% of our cells being “normal” cells.
All of the fundamental biochemical pathways used by living things to create large complex organic molecules from smaller monomers, or to break those large organic molecules back down into simple monomers were first developed by bacteria billions of years ago. For example, bacteria were the first forms of life to develop the biochemical pathways that turn carbon dioxide, water, and the nitrogen in the air into the organic molecules necessary for life – sugars, lipids, amino acids, and the nucleotides that form RNA and DNA. They also developed the biochemical pathways to replicate DNA and transcribe DNA into proteins, and to form complex structures such as cell walls and cell membranes from sugars, amino acids, proteins, and phospholipids. Additionally, bacteria invented the Krebs cycle to break these large macromolecules back down to monomers for reuse and to release and store energy by transforming ADP to ATP. To expand upon this, we will see in Software Symbiogenesis, how Lynn Margulis has proposed that all the innovations of large macroscopic forms of life have actually been acquired from the highly productive experiments of bacterial life forms.
Similarly, all of the fundamental coding techniques of IT at the line of code level were first developed in the Unstructured Period of the 1950s and 1960s, such as the use of complex variable names, arrays, nested loops, loop counters, if-then-else logic, list processing with pointers, I/O blocking, bubble sorts, etc. Now that I am in Middleware Operations, I do not do much coding anymore. However, I do write a large number of Unix shell scripts to help make my job easier. These Unix shell scripts are small unstructured programs in the range of 10 – 50 lines of code, and although they are quite primitive and easy to write, they have a huge economic pay-off for me. Many times, a simple 20 line Unix shell script that took less than an hour to write, will provide as much value to me as the code behind the IBM Websphere Console, which I imagine probably cost IBM about $10 - $100 million dollars to develop and comes to several hundred thousand lines of code. So if you add up all the little unstructured Unix shell scripts, DOS .bat files, edit macros, Excel spreadsheet macros, Word macros, etc., I bet that at least 50% of the software in the Software Universe is still unstructured code.
Figure 4 – An IBM OS/360 mainframe from 1964 (click to enlarge)
Figure 5 – A punch card from the Unstructured Period (click to enlarge)
Structured Period (1972 – 1992)
The increasing availability of computers with more memory and faster CPUs allowed for much larger programs to be written in the 1970s, but unstructured code became much harder to maintain as it grew in size, so the need for internal structure became readily apparent. Plus, around this time code began to be entered via terminals using full-screen editors, rather than on punched cards, which made it easier to view larger sections of code as you changed it.
Figure 6 – A mainframe with IBM 3278 CRT terminals attached (click to enlarge)
In 1972, Dahl, Dijkstra, and Hoare published Structured Programming, in which they suggested that computer programs should have complex internal structure with no GOTO statements, lots of subroutines, indented code, and many comment statements. During the Structured Period, these structured programming techniques were adopted by the IT community, and the GOTO statements were replaced by subroutines, also known as functions(), and indented code with lots of internal structure, like the eukaryotic structure of modern cells that appeared about 1,500 million years ago. Eukaryotic cells are found in the bodies of all complex organisms from single cell yeasts to you and me and divide up cell functions amongst a collection of organelles (subroutines), such as mitochondria, chloroplasts, Golgi bodies, and the endoplasmic reticulum.
Figures 3 and 7 compare the simple internal structure of a typical prokaryotic bacterium with the internal structure of eukaryotic plant and animal cells. These eukaryotic cells could be simple single-celled plants and animals or they could be found within a much larger multicellular organism consisting of trillions of eukaryotic cells. Figures 3 and 7 are a bit deceiving, in that eukaryotic cells are huge cells that are more than 20 times larger in diameter than a typical prokaryotic bacterium with about 10,000 times the volume. Because eukaryotic cells are so large, they have an internal cytoskeleton, composed of linear shaped proteins that form filaments that act like a collection of tent poles, to hold up the huge cell membrane encircling the cell.
Eukaryotic cells also have a great deal of internal structure, in the form of organelles, that are enclosed by internal cell membranes. Like the structured programs of the 1970s and 1980s, eukaryotic cells divide up functions amongst these organelles. These organelles include the nucleus to store and process the genes stored in DNA, mitochondria to perform the Krebs cycle to create ATP from carbohydrates, and chloroplasts in plants to produce energy rich carbohydrates from water, carbon dioxide, and sunlight.
Figure 7 – Plants and animals are composed of eukaryotic cells with much internal structure (click to enlarge)
The introduction of structured programming techniques in the early 1970s caused a mass extinction of unstructured programs, similar to the Permian-Triassic (P-T) mass extinction, or the Great Dying, 250 million years ago that divided the Paleozoic from the Mesozoic in the stratigraphic column and resulted in the extinction of about 90% of the species on Earth. As programmers began to write new code using the new structured programming paradigm, older code that was too difficult to rewrite in a structured manner remained as legacy “spaghetti code” that slowly fossilized over time in production. Like the Permian-Triassic (P-T) mass extinction, the mass extinction of unstructured code in the 1970s was more like a greenhouse mass extinction than an impactor mass extinction because it spanned nearly an entire decade, and was also a rather complete mass extinction which totally wiped out most unstructured code.
Below is a code snippet from a fossil COBOL program listed in a book published in 1975. Notice the structured programming use of indented code and calls to subroutines with PERFORM statements.
OPEN INPUT FILE-1, FILE-2
PERFORM MATCH-CHECK UNTIL ACCT-NO OF REC-1 = HIGH_VALUES.
CLOSE FILE-1, FILE-2.
IF ACCT-NO OF REC-1 < ACCT-NO OF REC-2
IF ACCT-NO OF REC-1 > ACCT-NO OF REC-2
DISPLAY REC-2, 'NO MATCHING ACCT-NO'
PERORM READ-FILE-2-RTN UNTIL ACCT-NO OF REC-1
NOT EQUAL TO ACCT-NO OF REC-2
When I encountered my very first structured FORTRAN program in 1975, I diligently “fixed” the program by removing all the code indentations! You see in those days, we rarely saw the entire program on a line printer listing because that took a compile of the program to produce and wasted valuable computer time, which was quite expensive back then. When I provided an estimate for a new system back then, I figured 25% for programming manpower, 25% for overhead charges from other IT groups on the project, and 50% for compiles. So instead of working with a listing of the program, we generally flipped through the card deck of the program to do debugging. Viewing indented code in a card deck can give you a real headache, so I just “fixed” the program by making sure all the code started in column 7 of the punch cards as it should!
Object-Oriented Period (1992 – Present)
During the Object-Oriented Period, programmers adopted a multicellular organization for software, in which programs consisted of many instances of objects (cells) that were surrounded by membranes studded with exposed methods (membrane receptors).
The following discussion might be a little hard to follow for readers with a biological background, but with little IT experience, so let me define a few key concepts with their biological equivalents.
Class – Think of a class as a cell type. For example, the class Customer is a class that defines the cell type of Customer and describes how to store and manipulate the data for a Customer, like firstName, lastName, address, and accountBalance. For example, a program might instantiate a Customer object called “steveJohnston”.
Object – Think of an object as a cell. A particular object will be an instance of a class. For example, the object steveJohnston might be an instance of the class Customer, and will contain all the information about my particular account with a corporation. At any given time, there could be many millions of Customer objects bouncing around in the IT infrastructure of a major corporation’s website.
Instance – An instance is a particular object of a class. For example, the steveJohnston object would be a particular instance of the class Customer, just as a particular red blood cell would be a particular instance of the cell type RedBloodCell. Many times programmers will say things like “This instantiates the Customer class”, meaning it creates objects (cells) of the Customer class (cell type).
Method – Think of a method() as a biochemical pathway. It is a series of programming steps or “lines of code” that produce a macroscopic change in the state of an object (cell). The Class for each type of object defines the data for the class, like firstName, lastName, address, and accountBalance, but it also defines the methods() that operate upon these data elements. Some methods() are public, while others are private. A public method() is like a receptor on the cell membrane of an object (cell). Other objects(cells) can send a message to the public methods of an object (cell) to cause it to execute a biochemical pathway within the object (cell). For example, steveJohnston.setFirstName(“Steve”) would send a message to the steveJohnston object instance (cell) of the Customer class (cell type) to have it execute the setFirstName method() to change the firstName of the object to “Steve”. The steveJohnston.getaccountBalance() method would return my current account balance with the corporation. Objects also have many internal private methods() within that are biochemical pathways that are not exposed to the outside world. For example, the calculateAccountBalance() method could be an internal method that adds up all of my debits and credits and updates the accountBalance data element within the steveJohnston object, but this method cannot be called by other objects (cells) outside of the steveJohnston object (cell). External objects (cells) have to call the steveJohnston.getaccountBalance() in order to find out my accountBalance.
Line of Code – This is a single statement in a method() like:
discountedTotalCost = (totalHours * ratePerHour) - costOfNormalOffset;
Remember methods() are the equivalent of biochemical pathways and are composed of many lines of code, so each line of code is like a single step in a biochemical pathway. Similarly, each character in a line of code can be thought of as an atom, and each variable as an organic molecule. Each character can be in one of 256 ASCII quantum states defined by 8 quantized bits, with each bit in one of two quantum states “1” or “0”, which can also be characterized as 8 electrons in a spin up ↑ or spin down ↓ state.
C = 01000011 = ↓ ↑ ↓ ↓ ↓ ↓ ↑ ↑
H = 01001000 = ↓ ↑ ↓ ↓ ↑ ↓ ↓ ↓
N = 01001110 = ↓ ↑ ↓ ↓ ↑ ↑ ↑ ↓
O = 01001111 = ↓ ↑ ↓ ↓ ↑ ↑ ↑ ↑
Developers (programmers) have to assemble characters (atoms) into organic molecules (variables) to form the lines of code that define a method() (biochemical pathway). As in carbon-based biology, the slightest error in a method() can cause drastic and usually fatal consequences. Because there is nearly an infinite number of ways of writing code incorrectly and only a very few ways of writing code correctly, there is an equivalent of the second law of thermodynamics at work. This simulated second law of thermodynamics and the very nonlinear macroscopic effects that arise from small coding errors is why software architecture has converged upon Life’s Solution. With these concepts in place, we can now proceed with our comparison of the evolution of software and carbon-based life on Earth.
Object-oriented programming actually started in the 1960s with Simula, the first language to use the concept of merging data and functions into objects defined by classes, but object-oriented programming did not really catch on until nearly 30 years later:
1962 - 1965 Dahl and Nygaard develop the Simula language
1972 - Smalltalk language developed
1983 - 1985 Sroustrup develops C++
1995 - Sun announces Java at SunWorld `95
Similarly, multicellular organisms first appeared about 900 million years ago, but it took about another 400 million years, until the Cambrian, for it to catch on as well. Multicellular organisms consist of huge numbers of cells which send messages between cells (objects) by secreting organic molecules that bind to the membrane receptors on other cells and induce those cells to execute exposed methods. For example, your body consists of about 100 trillion independently acting eukaryotic cells, and not a single cell in the collection knows that the other cells even exist. In an object-oriented manner, each cell just responds to the organic molecules that bind to its membrane receptors, and in turn, sends out its own set of chemical messages that bind to the membrane receptors of other cells in your body. When you wake to the sound of breaking glass in the middle of the night, your adrenal glands secrete the hormone adrenaline (epinephrine) into your bloodstream, which binds to the getScared() receptors on many of your cells. In an act of object-oriented polymorphism, your liver cells secrete glucose into your bloodstream, and your heart cells constrict harder, when their getScared() methods are called.
Figure 8 – Multicellular organisms consist of a large number of eukaryotic cells, or objects, all working together (click to enlarge)
These object-oriented languages use the concepts of encapsulation, inheritance and polymorphism which is very similar to the multicellular architecture of large organisms
Objects are contiguous locations in memory that are surrounded by a virtual membrane that cannot be penetrated by other code and are similar to an individual cell in a multicellular organism. The internal contents of an object can only be changed via exposed methods (like subroutines), similar to the receptors on the cellular membranes of a multicellular organism. Each object is an instance of an object class, just as individual cells are instances of a cell type. For example, an individual red blood cell is an instance object of the red blood cell class.
Cells inherit methods in a hierarchy of human cell types, just as objects form a class hierarchy of inherited methods in a class library. For example, all cells have the metabolizeSugar() method, but only red blood cells have the makeHemoglobin() method. Below is a tiny portion of the 210 known cell types of the human body arranged in a class hierarchy.
Human Cell Classes
2. Connective Tissue
A. Vascular Tissue
- Red Blood Cells
B. Proper Connective Tissue
A chemical message sent from one class of cell instances can produce an abstract behavior in other cells. For example, adrenal glands can send the getScared() message to all cell instances in your body, but all of the cell instances getScared() in their own fashion. Liver cells release glucose and heart cells contract faster when their getScared() methods are called. Similarly, when you call the print() method of a report object, you get a report, and when you call the print() method of a map, you get a map.
Figure 9 – Objects are like cells in a multicellular organism that exchange messages with each other (click to enlarge)
The object-oriented revolution, enhanced by the introduction of Java in 1995, caused another mass extinction within IT as structured procedural programs began to be replaced by object-oriented C++ and Java programs, like the Cretaceous-Tertiary extinction 65 million years ago that killed off the dinosaurs, presumably caused by a massive asteroid strike upon the Earth.
Below is a code snippet from a fossil C++ program listed in a book published in 1995. Notice the object-oriented programming technique of using a class specifier to define the data and methods() of objects instantiated from the class. Notice that PurchasedPart class inherits code from the more generic Part class. In both C++ and Java, variables and methods that are declared private can only be used by a given object instance, while public methods can be called by other objects to cause an object to perform a certain function, so public methods are very similar to the functions that the cells in a multicellular organism perform when organic molecules bind to the membrane receptors of their cells. Later in this posting we will describe in detail how multicellular organisms use this object-oriented approach to isolate functions.
class PurchasedPart : public Part
PurchasedPart(int pNum, char* desc);
void setPart(int pNum, char* desc);
PurchasedPart Nut(1, "Brass");
Like the geological eras, the Object-Oriented Period got a kick-start from an environmental hardware change. In the early 1990s, the Distributed Computing Revolution hit with full force, which spread computing processing over a number of servers and client PCs, rather than relying solely on mainframes to do all the processing. It began in the 1980s with the introduction of PCs into the office to do stand-alone things like word processing and spreadsheets. The PCs were also connected to mainframes as dumb terminals through emulator software as shown in Figure 6 above. In this architectural topology, the mainframes still did all the work and the PCs just displayed CICS green screens like dumb terminals. But this at least eliminated the need to have an IBM 3278 terminal and PC on a person’s desk, which would have left very little room for anything else! But this architecture wasted all the computing power of the rapidly evolving PCs, so the next step was to split the processing load between the PCs and a server. This was known as the 2-tier client/server or “thick client” architecture (Figure 10). In 2-tier client/server, the client PCs ran the software that displayed information in a GUI like Windows 3.0 and connected to a server running RDBMS (Relational Database Management System) software like Oracle or Sybase that stored the common data used by all the client PCs. This worked great so long as the number of PCs remained under about 30. We tried this at Amoco in the early 1990s, and it was like painting the Eiffel Tower. As soon as we got the 30th PC working, we had to go back and fix the first one! It was just too hard to keep the “thick client” software up and running on all those PCs with all the other software running on them that varied from machine to machine.
These problems were further complicated by the rise of computer viruses in the mid-1980s. Prior to the 2-tier client/server architecture, many office PCs were standalone machines, only connected to mainframes as dumb terminals, and thus totally isolated machines safe from computer virus infection. In the PC topology of the 1980s, computer viruses could only spread via floppy disks, which severely limited their infection rates. But once the 2-tier architecture fell into place, office PCs began to be connected together via LANs (Local Area Networks) and WANs (Wide Area Networks) to share data and other resources like printers. This provided a very friendly environment for computer viruses to quickly spread across an entire enterprise, so the other thing that office PCs began to share was computer viruses. Like the rest of you, I now spend about $40 per year with my favorite anti-virus vendor to protect my own home PC, and every Friday, my corporate laptop does its weekly scan, during which I suffer very sluggish response time the entire day. Over the years, I have seen these weekly scans elongate in time, as more and more viruses must be scanned for. My weekly scans have gone from about an hour, ten years ago, to nearly 7 hours today. It seems that even Intel cannot increase processor speeds as fast as new parasitic forms of software emerge! At this rate, in ten years my laptop will take a full week to run its weekly scan! Computer viruses are purely parasitic forms of software, which will be more fully covered in future postings on Self-Replicating Information and Software Symbiogenesis.
The limitations of the 2-tier architecture led to the 3-tier model in the mid to late 1990s with the advent of “middleware” (Figure 10). Middleware is software that runs on servers between the RDBMS servers and the client PCs. In the 3-tier architecture, the client PCs run “thin client” software that primarily displays information via a GUI like Windows. The middleware handles all the business logic and relies on the RDBMS servers to store data.
Figure 10 – The Distributed Computing Revolution aided object-oriented architecture (click to enlarge)
In the late 1990s, the Internet exploded upon the business world and greatly enhanced the 3-tier model (Figure 10). The “thin client” running on PCs now became a web browser like Internet Explorer. Middleware containing business logic was run on Application servers that produced dynamic web pages that were dished up by Web servers like Apache. Data remained back on mainframes or RDBMS servers. Load balancers were also used to create clusters of servers that could scale load. As your processing load increased, all you had to do was buy more servers for each tier in the architecture to support the added load. This opened an ecological niche for the middleware software that ran on the Appserver tier of the architecture. At the time, people were coming up with all sorts of crazy ways to create dynamic HTML web pages on the fly. Some people were using Perl scripts, while others used C programs, but these all required a new process to be spawned each time a dynamic web page was created and that was way too much overhead. Then Java came crashing down like a 10 kilometer wide asteroid! Java, Java, Java – that’s all we heard after it hit in 1995. Java was the first object-oriented programming language to take on IT by storm. The syntax of Java was very nearly the same as C++, without all the nasty tricky things like pointers that made C++ and C so hard to deal with. C++ had evolved from C in the 1980s, and nearly all computer science majors had cut their programming teeth on C or C++ in school, so Java benefited from a large population of programmers familiar with the syntax. The end result was a mass extinction of non-Java based software on the distributed computing platform and the rapid rise of Java based applications like an impactor mass extinction. Even Microsoft went Object-Oriented on the Windows server platform with its .NET Framework using its Java-like C# language. Procedural, non-Object Oriented software like COBOL, sought refuge in the mainframes where it still hides today.
Figure 11 – A modern multi-tier website topology (click to enlarge)
SOA - Service Oriented Architecture Period (2004 – Present)
Currently we are entering the Service Oriented Architecture (SOA) Period, which is very similar to the Cambrian Explosion. During the Cambrian Explosion, 541 million years ago, complex body plans first evolved, which allowed cells in multicellular organisms to make RMI (Remote Method Invocation) and CORBA (Common Object Request Broker Architecture) calls upon the cells in remote organs to accomplish biological purposes. In the Service Oriented Architecture Period, we are using common EJB components in J2EE appservers to create services that allow for Applications with complex body plans. The J2EE appservers perform the functions of organs like kidneys, lungs and livers. I am discounting the original appearance of CORBA in 1991 here as a failed precursor, because CORBA never became ubiquitous as EJB seems to be heading. In the evolution of any form of self-replicating information, there are frequently many failed precursors leading up to a revolution in technology.
There is a growing body of evidence beginning to support the geological "Snowball Earth" hypothesis that the Earth went through a period of 100 million years of extreme climatic fluctuations just prior to the Cambrian Explosion. During this period, the Earth seesawed between being completely covered with a thick layer of ice and being a hot house with a mean temperature of 140 0F. Snowball Earth (2003) by Gabrielle Walker is an excellent book covering the struggles of Paul Hoffman, Joe Kirschvink, and Dan Schrag to uncover the evidence for this dramatic discovery and to convince the geological community of its validity. It has been suggested that the resulting stress on the Earth's ecosystems sparked the Cambrian Explosion. As we saw above, for the great bulk of geological time, the Earth was dominated by simple single-celled organisms. The nagging question for evolutionary biology has always been why did it take several billion years for complex multicellular life to arise, and why did it arise all at once in such a brief period of geological time? Like our landfill example above, as a field geologist works up from pre-Cambrian to Cambrian strata, suddenly the rocks burst forth with complex fossils where none existed before. In the Cambrian, it seems like beer cans appeared from nothing, with no precursors. For many, the first appearance of complex life just following the climatic upheaval of the Snowball Earth is compelling evidence that these two very unique incidents in the Earth’s history must be related.
Similarly for IT, the nagging question is why did it take until the first decade of the 21st century for the SOA Cambrian Explosion to take place, when the first early precursors can be found as far back as the mid-1960s? After all, software based upon multicellular organization, also known as object-oriented software, goes all the way back to the object-oriented language Simula developed in 1965, and the ability for objects (cells) to communicate between CPUs arose with CORBA in 1991. So all the precursors were in place nearly 20 years ago, yet software based upon a complex multicellular architecture languished until it was jarred into existence by a series of harsh environmental shocks to the IT community. It was the combination of moving off the mainframes to a distributed hardware platform, running on a large number of servers and client PCs, the shock of the Internet upon the business world and IT, and the impact of Sun’s Java programming language, that ultimately spawned the SOA (Service Oriented Architecture) Cambrian Explosion we see in IT today. These shocks all occurred within a few years of each other in the 1990s, and after the dust settled, IT found itself in a new world of complexity.
Today, Service Oriented Architecture is rapidly expanding in the IT community and is beginning to expand beyond the traditional confines of corporate datacenters, as corporations begin to make services available to business partners over the Internet. With the flexibility of Service Oriented Architecture and the Internet, we are beginning to see the evolution of an integrated service oriented ecology form - a web of available services like the web of life in a rain forest.
To see how this works, let’s examine more closely the inner workings of a J2EE Appserver. Figure 12 shows the interior of a J2EE Appserver like WebSphere. The WebSphere middleware is software that runs on a Unix server which might host 30 or more WebSphere Appserver instances and there might be many physical Unix servers running these WebSphere Appserver instances in a Cell (Tier). Figure 11 shows a Cell (Tier 2) consisting of two physical Application servers or nodes, but there could easily be 4 or 5 physical Unix servers or nodes in a WebSphere Cell. This allows WebSphere to scale, as your load increases, you just add more physical Unix servers or nodes to the Cell. So each physical Unix server in a WebSphere Cell contains a number of software Appserver instances as shown in Figure 11, and each Appserver contains a number of WebSphere Applications which do things like create dynamic web pages for a web-based application. For example, on the far left of Figure 12 we see a client PC running a web browser like Internet Explorer. The web browser makes HTTP requests to an HTTP webserver like Apache. If the Apache webserver can find the requested HTML page, like a login page, it returns that static HTML page to the browser for the end-user to fill in his ID and PASSWORD. The user’s ID and PASSWORD are then returned to the Apache webserver when the SUBMIT button is pressed, but now the Apache webserver must come up with an HTML page that is specific for the user’s ID and PASSWORD like a web page with the end-user’s account information. That is accomplished by having Apache forward the request to a WebSphere Application running in one of the WebSphere Appservers. The WebSphere Appserver has two software containers that perform the functions of an organ in a multicellular organism. The Web Container contains instances of servlets and JSPs (Java Server Pages). A servlet is a Java program which contains logic to control the generation of a dynamic web page. JSPs are HTML pages with tags for embedded programming logic that are compiled into servlets at execution time. The servlets in the Web Container create objects and are run in a thread pool in the Web Container, like the cells in a liver or kidney. Unlike the mainframe processing of the Unstructured Period, in which a program was loaded into memory, run, and then perished, these servlets remain in memory and are continuously reused by the thread pool to service additional requests, until no further requests arrive and the servlet is destroyed to make room for another servlet in the thread pool. The EJB Container performs a similar function by running EJBs (Enterprise Java Beans) in a thread pool. The EJBs provide business logic and connect to databases (DB) and mainframes (EIS – Enterprise Information Systems). By keeping the servlets and EJBs running continuously in memory, with permanent connections to databases and mainframes via connection pools, the overhead of loading and releasing the servlets is eliminated as well as the creation and tear-down of connections to databases and mainframes. So the Web and EJB Containers of a J2EE Appserver are very much like the cells in an organ which continuously provide services for the other cells of a multicellular organism. Look at it this way, unlike a simple single-celled organism that is born, lives, and dies, your body consists of 100 trillion cells and each day about a trillion cells die and are replaced by a trillion new cells, but through it all you keep going. A simple single-celled organism is like a batch program from the Unstructured Period, while your body runs on a SOA architecture of trillions of cells in thread and connection pools that are constantly coming and going and creating millions of objects that are created (instantiated), used, and later destroyed.
Figure 12 – Middleware running in a J2EE Application Server (click to enlarge)
Design Patterns – the Phyla of IT
Another outgrowth of the object-oriented programming revolution was the adoption of design patterns by IT. Design patterns originated as an architectural concept developed by Christopher Alexander in the 1960s. In Notes on the Synthesis of Form (1964), Alexander noted that all architectural forms are really just implementations of a small set of classic design patterns that have withstood the test of time in the real world of human affairs and that have been blessed by the architectural community throughout history for both beauty and practicality. Basically, given the physical laws of the Universe and the morphology of the human body, there are really only a certain number of ways of doing things from an architectural point of view that work in practice, so by trial and error architects learned to follow a set of well established architectural patterns. In 1987, Kent Beck and Ward Cunningham began experimenting with the idea of applying the concept of design patterns to programming and presented their results at the object-oriented OOPSLA conference that year. Design patterns gained further popularity in computer science after the book Design Patterns: Elements of Reusable Object-Oriented Software was published in 1994 by Erich Gamma, Richard Helm, and Ralph Johnson. Also in 1994, the first Pattern Languages of Programming Conference was held, and in 1995 the Portland Pattern Repository was established to document design patterns for general IT usage.
However, the concept of design patterns goes back much further than this. In biology a design pattern is called a phylum, which is a basic body plan. For example, the phylum Arthropoda consists of all body plans that use an external skeleton such as the insects and crabs, and the Echinodermata have a five-fold radial symmetry like a starfish. Similarly, the phylum Chordata consists of all body plans that have a large dorsal nerve running down a hollow backbone or spinal column. The Cambrian Explosion, 541 million years ago, brought about the first appearance of a large number of phyla or body plans on Earth. In fact, all of the 35 phyla currently found on the Earth today can trace their roots back to the Cambrian, and it even appears that some of the early Cambrian phyla have gone completely extinct, judging by some of the truly bizarre-looking fossils that have been found in the Burgess shale of the highly experimental Cambrian period.
In IT a design pattern describes a certain design motif or way of doing things. A design pattern is a prototypical design architecture that developers can copy and adapt for their particular application to solve the general problem described by the design pattern. This is in recognition of the fact that at any given time there are only a limited number of IT problems that need to be solved at the application level, and it makes sense to apply a general design pattern rather than to reinvent the wheel each time. Developers can use a design pattern by simply adopting the common structure and organization of the design pattern for their particular application, just as living things adopt an overall body plan or phylum to solve the basic problems of existence. In addition, design patterns allow developers to communicate with each other using well-known and well understood names for software interactions, just as biologists can communicate with each other by using the well-known taxonomic system of classification developed by Carl Linnaeus in Systema Naturae published in 1735.
A design pattern that all Internet users should be quite familiar with is the Model-View-Controller (MVC) design pattern used by most web-applications. Suppose you are placing an order with Amazon. The Model is the data that comprises your Amazon account information, such as your credit card number on file and your mailing address, together with all the items in your shopping cart. In Figure 12 above, the Model is stored on a relational database server DB, such as an Oracle server, or back on a mainframe in an EIS (Enterprise Information System) connected to a mainframe DB2 database as a series of relational database tables. The View is the series of webpages presented to your browser as .html pages that convey the Model data to you in a sensible form as you go about your purchase. These View .html pages are generated by JSPs (Java Server Pages) in the web container of the J2EE Appserver. The Controller is a servlet, a java program running in a thread pool in the web container of the J2EE Appserver, that performs the overall control of your interactions with the Amazon application as you go about placing your order. The Controller servlet calls JSPs and instantiates objects (cells) that call EJB objects (cells) in the EJB container of the J2EE Appserver that interact with the relational database tables storing your data.
It has taken the IT community nearly 60 years to develop a Service Oriented Architecture based upon multicellular organization. This was achieved through a slow evolutionary process via innovation and natural selection performed by millions of independently acting programmers. Granted, this occurred much faster than the three billion years nature took to come up with the same architecture, but we could have done this back in the 1960s if we had known better – after all, the object-oriented language Simula was developed in 1965. Softwarephysics proposes that we use concepts from biology to skip to solutions directly.
Now let’s dig a little deeper and examine biological and computer software at a lower biochemical level. Before getting into the biochemistry of living things, let's briefly review the softwarechemistry of computer software. Computer software is composed of program statements, and program statements are composed of characters. Every computer language has a syntax - a set of allowed statement formats such as if, for, while which are used to perform functions. Programmers are faced with the job of assembling the 256 ASCII characters into valid program statements called source code and then sequencing the program statements in the proper order to create a routine or method to perform a given function. Routines are combined to form programs, and methods are combined to form objects, and both are combined to form applications or systems. Creating computer software is very difficult because a single erroneous character in a system composed of hundreds of thousands of lines of code can cause grave consequences.
Living things are faced with much the same problem. For this discussion, we will consider proteins to be biological software. Like computer software, proteins are composed of amino acids (program statements) which are in turn composed of atoms (characters). To form an amino acid, one combines an amino group containing nitrogen N with a carboxyl group COOH and attaches an arbitrary side group “R” to the amino acid backbone.
H H O
| | ||
A very large number of amino acids can be formed in this way, just as a very large number of program statements can be formed from 256 characters. Every programming language has a valid syntax, and nature follows one also. Of all the possible amino acids, all living things use the same 20 amino acids to build proteins. These 20 amino acids are then assembled into a chain of amino acids called a polypeptide chain. The polypeptide chain then folds up into a protein on its own to minimize the free energy of the polypeptide chain.
Biological hardware and computer hardware also share some similarities. Biological hardware is based upon the energy transitions of molecules containing carbon atoms called organic molecules, while computer hardware is currently based upon the energy transitions of electrons in silicon crystals known as integrated circuit chips. Carbon and silicon are very similar atoms. Silicon lies directly beneath carbon in the Periodic Table because both elements have four electrons in their outer shell and are also missing four electrons in their outer shell. The four missing electrons allow carbon to bind to many other atoms to form molecules rich in energy or information, and the four missing electrons of silicon make it a semiconductor, which means that silicon can be switched from a conducting state to a nonconducting state under certain circumstances to form a transistor. Transistors are currently used for the high speed switches required by the logic gates of computers. The binding energy of carbon to other atoms is just right – not too strong and not too weak, just enough to keep organic molecules together, but not too tightly. The binding energy of silicon to other atoms, like oxygen, is much too strong for living things, and that is why silicon is good for making rocks, while carbon is good for making squishy living things.
In SofwareChemistry, we saw that carbon was a singularly unique atom, in that it could form sp, sp2, and sp3 molecular orbitals to bind with other atoms. This allows carbon to form linear molecules using sp bonds, triangular sheet-like molecules using sp2 bonds, and tetrahedral-shaped molecules using sp3 bonds with two, three, or four other atoms. Similarly, silicon Si can combine with oxygen O to form a tetrahedral-shaped ionic group called silicate SiO4, which has a charge of negative four (-4), and which looks very much like the methane molecule shown in Figure 6 of SoftwareChemistry, only with silicon Si at the center bound to four oxygen O atoms. The negatively charged SiO4 tetrahedrons combine with positively charged cations of iron, magnesium, aluminum, calcium, and potassium to form silicate minerals which form about 90% of the Earth’s crust. Just like carbon, these silicate minerals can form very complicated 3-dimensional structures of repeating tetrahedra that form single chains, double chains, rings and sheets. Some have suggested that it might be possible for alien forms of life to be silicon-based, rather than carbon-based, but because of the high binding energy of silicate minerals, silicon-based chemical reactions are probably much too slow for silicon-based life forms to have evolved. However, in the Software Universe, we do find that silicon-based life has arisen, not based upon the chemical characteristics of silicon, but based upon the semiconductor characteristics of silicon.
Biological hardware uses 4 classes of organic molecules – carbohydrates, lipids, proteins, and nucleic acids.
Carbohydrates - These molecules are rich in energy and provide the energy required to overcome the second law of thermodynamics in biological processes. Living things degrade the high grade, low entropy, chemical energy stored in carbohydrates into high entropy heat energy in the process of building large complex molecules to carry out biological processes. The end result is the dumping of structural entropy into heat entropy. Similarly, computer hardware degrades high grade, low entropy, electrical energy into high entropy heat energy in order to process information.
Lipids - These molecules are also rich in energy, but certain kinds of lipids perform an even more important role. We saw in SoftwareChemistry that the water molecule H2O is a polar molecule, in that the oxygen atom O has a stronger attraction to the shared electrons of the molecule than the hydrogen atoms H, consequently, the O side of the molecule has a net negative charge, while the H side of the molecule has a net positive charge. Again, this all goes back to quantum mechanics and QED with the exchange of virtual photons that carry out the electromagnetic force. Similarly, the phosphate end of a phospholipid is also charged, and consequently, is attracted to the charged polar molecules of water, while the fatty tails of a phospholipid that carry no net charge, are not attracted to the charged polar molecules of water.
O <- Phosphate end of a phospholipid has a net electrical charge
|| <- The tails of phospholipid do not have a net electrical charge
This causes phospholipids to naturally form membranes, which are used to segregate biologically active molecules within cells and to isolate cells from their environment. Lipids perform a function similar to electrical insulation in computer hardware, or the class definition of an object in Java or C++. What happens is that the phospholipids form a bilayer with the electrically neutral tails facing each other, while the electrically charged phosphate ends are attracted to the polar water molecules inside and outside of the membrane. This very thin bilayer, only two molecules thick, is like a self-sealing soap bubble that naturally takes on a cell-like spherical shape. In fact, soap bubbles form from a similar configuration of lipids, but in the case of soap bubbles, the electrically neutral tails face outward towards the air inside and outside of the soap bubble, while the electrically charged phosphate ends face inwards. In both cases these configurations minimize the free energy of the molecules. Remember, free energy is the energy available to do work, and one expression of the second law of thermodynamics is that systems always try to minimize their free energy. That’s why a ball rolls down a hill; it is seeking a lower level of free energy. This is a key insight. Living things have learned through natural selection not to fight the second law of thermodynamics, like IT professionals routinely do. Instead, living things use the second law of thermodynamics to their advantage, by letting it construct complex structures by simply having molecules seek a configuration that minimizes their free energy. Biochemical reactions are like a rollercoaster ride. Instead of constantly driving a rollercoaster car around a hilly track, rollercoasters simply pump up the potential energy of the car by first dragging it up a large hill, and then they just let the second law of thermodynamics take over. The car simply rolls down hill to minimize its free energy. Every so often, the rollercoaster might include some booster hills, where the car is again dragged up a hill to pump it up with some additional potential energy. Biochemical reactions do the same thing. Periodically they pump up the reaction with energy from degrading ATP into ADP. The ATP is created in the Krebs cycle displayed in Figure 10 of SoftwareChemistry. Then the biochemical reactions just let nature take its course via the second law of thermodynamics, to let things happen by minimizing the free energy of the organic molecules. Now that is the smart way to do things!
Below is a depiction of a biological membrane constructed of phospholipids that essentially builds itself when you throw a phospholipid into water. For example, if you throw the phospholipid lecithin, the stuff used to make chocolate creamy, into water it will naturally form spherical cell-like liposomes, consisting of a bilayer of lecithin molecules with water inside the liposome and outside as well.
Outside of a cell membrane there are polar water molecules “+” that attract the phosphate ends of phospholipids “O”:
Inside of a cell membrane there also are polar water molecules “+” that attract the phosphate ends of phospholipids “O”, resulting in a bilayer.
Proteins - These are the real workhorse molecules that perform most of the functions in living things. Proteins are large molecules, which are made by chaining together several hundred smaller molecules called amino acids. An amino acid contains an amine group on the left containing nitrogen and a carboxyl group COOH on the right. Attached to each amino acid is a side chain “R” which determines the properties of the amino acid. The properties of the side chain “R” depend upon the charge distributions of the atoms in the side chain.
H H O
| | ||
The amino acids can chain together into very long polypeptide chains by forming peptide bonds between the amine group of one amino acid and the carboxyl group of its neighbor, releasing a water molecule in the process.
There are 20 amino acids used by all living things to make proteins.
There are four types of proteins:
1. Structural Proteins - These are proteins used for building cell structures such as cell membranes or the keratins found in finger nails, hooves, scales, horns, beaks, and feathers. Structural proteins are similar to webserver static content, such as static HTML pages and .jpg and .gif graphics files that provide structure to a website, but do not perform dynamic processing of information. However, we shall see that some structural proteins in membranes do perform some logical operations.
2. Enzymes - These are proteins that do things. Enzymes perform catalytic functions that dramatically speed up the chemical processes of life. Enzymes are like message-driven or entity EJBs that perform information-based processes. Enzymes are formed from long polypeptide chains of amino acids, but when they fold up into a protein, they usually only have a very small active site, where the side chains of a handful of amino acids in the chain do all the work. As we saw in Figure 9 of SoftwareChemistry, the charge distribution of the amino acids in the active site can match up with the charge distribution of other specific organic molecules, forming a lock and key configuration, to either bust the organic molecules up into smaller molecules or paste them together into a larger organic molecule. Thus, you can think of enzymes as a text editor that can do “cut and paste” operations on source code. These “cut and paste” operations can proceed at rates as high as 500,000 transactions per second for each enzyme molecule.
3. Hormones - These are control proteins that perform logical operations. For example, the hormone insulin controls how your cells use glucose, and large doses of testosterone or estrogen dramatically change your physical appearance. Hormones are like session EJBs that provide logical operations.
4. Antibodies - These are security proteins which protect against invading organisms. These proteins can learn to bind to antigens, proteins embedded in the cell membranes of invading organisms, and then to destroy the invaders. Antibodies are like SSL or LDAP security software.
Now let us assemble some of the above components into a cell membrane. Please excuse the line printer graphics. I was brought up on line printer graphics, and it is hard to break old habits. Plus, it saves on bandwidth and decreases webpage load times. First we insert a structural protein “@” into the phospholipid bilayer. This structural protein is a receptor protein that has a special receptor socket that other proteins called ligands can plug into.
Below we see a ligand protein “L” approach the membrane receptor protein, and we also see an inactive enzyme “I” in the cytoplasm inside of the cell membrane:
When the ligand protein “L” plugs into the socket of the membrane protein receptor “@”, it causes the protein receptor “@” to change shape and open a socket on the inside of the membrane wall. The inactive enzyme plugs into the new socket and becomes an active enzyme “A”:
The activated enzyme “A” is then released to perform some task. This is how a message can be sent to a cell object by calling an exposed method() via the membrane protein receptor “@”. Again this is all accomplished via the electromagnetic interaction of the Standard Model with the exchange of virtual photons between all the polar molecules. In all cases, the second law of thermodynamics comes into play, as the ligand, membrane receptor protein, and inactive enzyme go through processes that minimize their free energy. When the ligand plugs into the membrane receptor protein, it causes the receptor protein to change its shape to minimize its free energy, opening a socket on the interior of the cell membrane. It’s like entering an elevator and watching all the people repositioning themselves to minimize their proximity to others.
Another important role of membranes is their ability to selectively allow certain ions to pass into or out of cells, or even to pump certain ions in or out of a cell against an electrical potential gradient. This is very important in establishing the electrical potential across the membrane of a neuron, which allows the neuron to transmit information. For example, neurons use K+ and Na+ pumps to selectively control the amount of K+ or Na+ within. Below is depicted a K+ channel in the closed position, with the embedded gating proteins tightly shut, preventing K+ ions from entering the neuron:
When the K+ channel is opened, K+ ions can be pumped from the outside into the neuron:
+++++++++++++++++++@@ K @@++++++++++++++++++++++++++
|||||||||||||||||||||||||||||||||||@@ K @@||||||||||||||||||||||||||||||||||||||||||||||||
|||||||||||||||||||||||||||||||||||@@ K @@||||||||||||||||||||||||||||||||||||||||||||||||
+++++++++++++++++++@@ K @@++++++++++++++++++++++++++
So where does the information required to build these proteins come from?
Nucleic Acid - All genetic information on the planet is currently stored in nucleic acid. Nucleic acid is the Data Base Management System (DBMS) hardware used by all living things. Nucleic acid is used to store the instructions necessary to build proteins and comes in two varieties, DNA and RNA. DNA is the true data store of life, used to persist genetic data, while RNA is usually only used as a temporary I/O (Input/Output) buffer. However, there are some viruses, called retroviruses, that actually use RNA to persist genetic data too.
DNA is a very long molecule, much like a twisted ladder. The vertical sides of the ladder are composed of sugar and phosphate molecules and bound to the sides are rungs composed of bases. The bases are called Adenine (A), Cytosine (C), Guanine (G), and Thymine (T). Because the bases have different shapes, Adenosine (A) can only pair with Thymine (T), and Cytosine (C) can only pair with Guanine (G) to form a base pair rung along the length of the DNA molecule. Again, this is accomplished by forming lock and key configurations between the various 3-dimensional charge distributions on the bases, as shown in Figure 9 of SoftwareChemistry. So the beautiful double-helix structure of DNA all goes back to the strange quantum mechanical characteristics of carbon, with its four valence electrons that have p orbitals with angular momentum, allowing for complex 3-dimensional bonds to form between atoms. If the Universe did not contain carbon, or if electrons did not have angular momentum in their p orbitals, or if electrons did not have to have a unique set of quantum numbers and could all just pile up in the 1s orbital of carbon atoms, there would be no life in the Universe, and we would not be here contemplating the wonder of it all. Again, this is an example of the weak Anthropic Principle in action.
DNA is replicated in living things by unzipping the two sides of the ladder and using each side of the ladder as a template to form its mirror image. This is all done with enzyme proteins that dramatically speed up the process to form two strands of DNA from the original copy. In bacteria, DNA can be replicated at about 1000 base pairs per second, while in human cells the replication rate is only about 50 base pairs per second, but replication can proceed at multiple forks simultaneously.
RNA is very similar to DNA in structure, except that RNA uses a substitute base called Uracil (U) in place of Thymine (T) to bind to Adenine (A). Also, RNA is usually only a half ladder, it does not have a complementary side as does DNA. However, the bases along a strand of RNA can bind to each other to form complex tangled structures, leaving some bases still exposed to bind to other molecules. This allows RNA to form complicated structures with some of the properties of enzymes. Thus RNA can form a tangled structure called transfer RNA (tRNA) and combine with small proteins to form ribosomes (rRNA). Ribosomes and tRNA are used in the process that transcribes the information stored in DNA into a protein, which will be described shortly. So RNA is a much more dynamic molecule than DNA because RNA can both store genetic information and perform limited dynamical operations on organic molecules in a manner similar to enzymes. We shall see shortly, that the chief advantage of DNA is that it is a ladder with two sides, which allows for data to be persisted more accurately and allows for data recovery when there is a hardware failure within the molecule.
Computer hardware is built with semiconductors using base 2 or binary logic. The voltages on a transistor can be set so that the transistor is either conducting or nonconducting. These two states can be used to store a "1" or a "0", which we define as a bit. To define a unique bit map for all the characters and numbers we want to store, we use 8 bits, which we define as a byte. This yields 256 bit maps which can be used to code for characters. For example, in ASCII:
Character = Bits
A = 01000001
B = 00100000
C = 01000011
ASCII stands for American Standard Code for Information Interchange. Since computers can only understand numbers, an ASCII code is the numerical representation of a character such as 'A' or '@' or an action of some sort. ASCII was developed a long time ago, and now the non-printing characters 127 - 255 are rarely used for their original purpose. Below is the ASCII character set for values 0 - 127, and this includes descriptions for the first 32 non-printing characters. ASCII was actually designed for use with teletypes, like the one I used on my old DEC PDP 8/e minicomputer, and so the descriptions of the first 32 non-printing characters are now somewhat obscure, except for ACK, SYN, and NAK, still used in the TCP/IP protocol. Below are the meanings of some of the first 32 non-printing characters:
SOH – Start of Heading
STX – Start of Text
ETX – End of Text
EOT – End of Transmission
ENQ – Enquiry
ACK – Acknowledge
NAK – Negative Acknowledge
BEL – Sound the bell
BS – Backspace the printer carriage
TAB – tab the printer carriage
LF – Line Feed
CR – Carriage Return (return the printing carriage back to the left)
SYN – Synchronous Idle
ESC – Escape
SPACE – Print a blank space
The ASCII Code
Biological hardware uses base 4 logic because a DNA base can be in one of four states A,C,G, or T. A unique bit map is required for each of the 20 amino acids, and all living things use this same bit map. A two-bit byte could code for 4 x 4 = 16 amino acids, which is not enough. A three-bit byte can code for 4 x 4 x 4 = 64 possible amino acids. A three-bit byte is, therefore, the smallest possible byte, and this is exactly what nature uses. Three DNA base pairs define a biological byte, which is the code for a particular amino acid. Biologists call these three-bit bytes codons.
SER = TCA
VAL = CAG
GLN = CCT
As with ASCII, there are 3 special “non-printing” bytes, or codons, used for control purposes, called stop codons – ATT, ACT, and ATC. The functions of these stop codons will be explained later.
The Genetic Code
As you can see, since we only need to code for 20 amino acids, but have 64 possible bit maps, there is quite a bit of redundancy, in that many of the three-bit bytes code for the same amino acid. Because living things have to deal with the second law of thermodynamics and nonlinearity, just as you do as a programmer, the coding map has evolved through natural selection, so that when likely errors do occur, the same amino acid still gets mapped. So if one of the three-bit bytes coding for a protein gets fat-fingered, you still end up with the same amino acid being loaded into the polypeptide chain for the protein under construction, like Word correcting your spelling on the fly when you fat-finger some text! For example, the byte GAA coding for the amino acid leucine LEU, might suffer a mutation to the byte GAT, but the new GAT byte still maps to the amino acid leucine LEU, so no harm is done.
DNA is really a very long sequential file, which is subdivided into genes. A gene is several thousand base pairs or bits long and defines the amino acid sequence of a particular protein. To build a protein, the DNA file is opened (the molecule literally opens along the base pair bonds), and the DNA bits are transcribed to an I/O buffer called messenger RNA (mRNA). The mRNA bases bind to the DNA bases, forming a complementary mirror image, according to the pairing rules A-U C-G. This proceeds until one of the previously mentioned stop codons is encountered,which ends the transcription process. The mRNA then carries the code of the DNA molecule into the cytoplasm for further processing by spherical clumps of nucleic acid called ribosomes. Ribosomes perform a function similar to the read/write head of a Turing machine. The ribosomes sequentially read the bytes along an mRNA strand and output a polypeptide chain (protein). This is accomplished by sequentially aligning amino acid charged transfer RNA (tRNA) according to the sequence on the mRNA strand. Parallel processing is achieved by pipelining multiple ribosomes simultaneously down the same strand of mRNA. The entire process is conducted by a set of specific enzymes at a rate of about 30 amino acids per second. The end result is a polypeptide chain of amino acids.
The polypeptide chain then folds itself up into a protein, as it tries to minimize its free energy in accordance with the second law of thermodynamics. For example, an enzyme protein might fold itself up into a wrench shaped molecule, with an active site exposed for other molecules to plug into. The electrical charge distribution of the “R” side chains of the amino acids in the polypeptide chain at the active site will do all the work. Once again, see how living things do not fight the second law of thermodynamics, but use it to self-assemble parts. It’s like throwing a bunch of iron filings into a bag and having them self-assemble into a wrench, or throwing a bunch of scrabble tiles on the floor and having them self-assemble into perfect source code!
As you might expect, DNA is very sensitive to the second law of thermodynamics. Sometimes a DNA molecule will drop or add a base pair bit within a gene. This is usually fatal to the host cell because the word alignment of the gene bytes gets shifted, and all of the amino acids coded by the rest of the gene are incorrect. A usually less fatal mutation occurs when a bit changes from one state to another. In this case, the coded protein will be produced with only one amino acid in error. The shape of the erroneous protein may be close enough to the correct protein that it can serve the same function. However, such a substitution mutation can also have serious consequences. In each cell in your bladder, there is a gene with a byte containing the coding sequence below. It has been found that the alteration of one bit from G to T in the byte will produce a protein which causes the host cell to become malignant. Genes with this property are known as oncogenes.
NORMAL BYTE OR CODON
CANCER BYTE OR CODON
Next we need to discuss how the information encoded in DNA genes is transcribed into a protein in greater detail. The DNA bases in a gene, used to code for the sequence of amino acids in a protein, are called the operon of the gene. Just upstream from the operon is a section of DNA which is used to initiate mRNA transcription and to identify the gene. These leading bases are called the promoter of the gene, and their purpose is to signal the onset of the gene. The enzyme which builds the mirror image mRNA strand from the DNA template is called RNA polymerase. The promoter consists of a sequence of bases, which provides RNA polymerase a section of DNA to grip onto and initiate transcription. For example, the promoter for E. coli bacteria consists of two strings, "TTGACA" and "TATATT". The first string is located about 35 bits upstream of the codon portion of E. coli genes, while the second string is located about 10 bits upstream. Just past the promoter for the gene is a section of DNA called the operator. The operator is several bits long and its sequence of bases uniquely identifies the gene, like the key field on a record in a sequential file. When E. coli bacteria have enough of a particular protein, a repressor protein will begin to grab onto the operator for the gene of the protein no longer needed. The repressor protein derails the RNA polymerase before transcription can begin and essentially turns the gene off. Trailing the codon portion of genes is a sequence of DNA bases which signals the end of the gene, a stop codon. The DNA bases between the end of gene sequence of one gene and the promoter for the next gene consists of "junk" spacer DNA, which apparently does not encode information.
As we have seen, DNA is very fragile to mutation. One bad bit can be fatal to the host organism. Nature uses several techniques to protect the vital DNA, which are similar to techniques independently developed to protect computer data. To illustrate this, we shall compare the storage of data in DNA with standard magnetic tape. So let us take a nostalgic trip back to 1964 and the IBM System/360 (Figures 4 and 5), which introduced standard 9 track magnetic tape. These tapes were ½ inch wide and 2400 feet long and were stored on a reel. The tape had 9 separate tracks, 8 tracks were used to store data and one track, the parity track, was used as a check. The 8 bits of a byte were stored across the eight data tracks of the tape. Odd parity was used to check the 8 bits of a byte. If the 8 bits added up to an even number of 1s, then the parity bit was set to "1" which was the odd complement of the even number of 1s. If the 8 bits added up to an odd number of 1s, then the parity bit was set to "0". This allowed the computer to determine if one of the bits accidentally changed states from a "1" to "0" or vice versa.
|101101100| <- 8 data tracks followed by one parity track on far right
|011101100| the sum of all the 1s across the 9 tracks is an odd number
|100011101| <- 8 bits = 1 byte
|101001001| <- Multiple Blocked Records
|011001101| < End of Block
| | <- Inter Record Gap = 0.60 inches
|101101100| <- Start of Next Block (Begins with the Record Key Field like a Social Security Number)
Computer data was normally stored as records, a collection of data bytes coding for characters which had some particular meaning, such as the payroll data about a particular employee. To improve I/O, the records were normally blocked together into blocks of records, which could be read and written with one I/O operation by the tape drive. Between the blocks of records on a tape there was a 0.6 inch inter-record gap, which segregated blocks of records. The tapes could be read forwards or backwards, under program control, to allow the programmer to get to particular groups of records, as can be seen in old science fiction movies of computers with blinking lights and tape drives rapidly spinning back and forth in the background. To identify a particular record, a unique key field was frequently used for the first few bytes of each record. For example, the first 9 bytes of a payroll record might be an employee's social security number. This key field would uniquely identify the data on the record as belonging to a particular employee.
Typically, a programmer could access a tape with the following JCL (Job Control Language) statements.
//SYSUT1 DD DSN=DECEMBER.PAYROLL.MASTER,VOL=SER=111111,
This JCL explains that the DSN or DataSet Name of the data on the tape is the DECEMBER.PAYROLL.MASTER and that the Volume Serial number that appears on the outside of the tape reel is “111111”, which allows the computer operator to fetch the proper tape. The UNIT=TAPE lets everybody know this is a tape file and not a file on a disk drive. The DISP=OLD means that the data on the tape already exists, and we do not need a fresh blank tape to write on. The RECFM=FB means that this tape is blocked. The LRECL record length of individual records in a block is 80 bytes. The block size BLKSIZE is 8000 bytes, so that means there are 100 records in each block of data on the tape. The story goes that JCL was developed by IBM one weekend as a temporary job control language, until a “real” job control language could be written for the new System/360 operating system. However, the above JCL will still work just fine at any datacenter running the IBM z/OS operating system on a mainframe.
Originally, 9 track tapes had a density of 1600 bytes/inch of tape, with a data transfer rate of 15,000 bytes/second. Later, 6250 bytes/inch tape drives became available, with a maximum data capacity of 170 megabytes for a 2400 ft reel blocked at 32,767 bytes per block. Typically, much smaller block sizes, such as 4K (4,096 bytes) were used, in which case the storage capacity of the tape was reduced by 33% to 113 megabytes. Not too good, considering that you can now buy a PC disk drive that can hold 2,000 times as much data for about $100.
Data storage in DNA is very similar to magnetic tape. DNA uses two tracks, one data track and one parity track. Every three biological bits (bases) along the data track forms a biological byte, or codon, which is the code for a particular amino acid. The biological bits are separated by 3.4 Angstroms along the DNA molecule, yielding a data density of about 25 megabytes/inch. Using this technology, human cells store 2 billion bytes of data on 7 feet of DNA. The 100 trillion cells in the human body manage to store and process 200,000 billion gigabytes of data stored on 700 trillion feet of DNA, enough to circle the Earth 5.3 million times.
|-CG-| <- One Data Track on Left, With Corresponding Parity Track on Right
|-GC-| | 3 Bits = 1 Byte
|-TA-| <- Multiple blocked genes
|-TA-| <- End of Gene Block
|-CG-| <- Inter Gene Gap of Junk DNA
|-GC-| <- Start of Next Gene Block
|-AT-| <- Gene Block Key Field (Promoter and Operator)
|-AT-| <- First Gene of Block (Operon)
The higher forms of life (eukaryotes) store DNA in the nucleus of the cell folded tightly together with protective structural proteins called histones to form a material known as chromatin out of which the structures known as chromosomes form. There are 6 histone proteins and DNA is wound around these proteins like magnetic tape was wound around tape spools in the 1960s and 1970s. As you can see, these histone proteins are critical proteins, and all eukaryotic forms of life, from simple yeasts to peas, cows, dogs, and people use essentially the same histone proteins to wind up DNA into chromatin. The simpler forms of life, such as bacteria, are known as prokaryotes. Prokaryotes do not have cell nuclei, their DNA is free to float in the cell cytoplasm unrestrained. Because the simple prokaryotes are basically on their own and have to react quickly to an environment beyond their control, they generally block genes together based upon functional requirements to decrease I/O time. For example, if it requires 5 enzymes to digest a particular form of sugar, bacteria would block all 5 genes together to form an operon which can be transcribed to mRNA with one I/O operation. The genes along the operon gene block are variable in length and are separated by stop codons or end-of-gene bytes. The stop codons cause the separate proteins to peel off as the ribosome read/write head passes over the stop codons on the mRNA.
Eukaryotic cells, on the other hand, generally perform specific functions within a larger organism. They exist in a more controlled environment and are therefore less I/O intensive than the simpler prokaryotes. For this reason, eukaryotes generally use unblocked genes. For both eukaryotes and prokaryotes, each gene or block of genes is separated by an inter-gene gap, composed of nonsense DNA, similar to the inter-record gap of a 9 track tape. Whether the operon block consists of one gene or several genes, it is usually identified by a unique key field called the operator (recall that the operator lies just past the promoter for the gene), just as the records on a 9 track tape use a key field, like a social security number, to uniquely identify a record on the tape. The operator is several bytes long and its sequence of bases uniquely identifies the gene. To access genes, cells attach or detach a repressor molecule to the operator key field to turn the gene on or off.
Nature also uses odd parity as a data check. If a bit on the data track is an A, then the parity bit will be a T, the complement of A. If the data bit is a G, then the parity bit will be a C. Using only two tracks decreases the data density of DNA, but it also improves the detection and correction of parity errors. In computers, when a parity error is detected, the data cannot be restored because one cannot tell which of the 8 data bits was mutated. Also, if two bits should change, a parity error will not even be detected. By using only one data track and an accompanying parity track, parity errors are more reliably detected, and the data track can even be repaired based upon the information in the parity track. For example, during DNA replication, three enzymes are used to check, double check, and triple check for parity errors. Detected parity errors are corrected at the time of replication by these special enzymes. Similarly, if the data track of a DNA molecule should get damaged by radiation from a cosmic ray, the DNA molecule can be repaired by fixing the data track based on the information contained on the parity track. There are enzyme proteins which are constantly running up and down the DNA molecule checking for parity errors. When an error is detected, the error is repaired using the information from the undamaged side of the molecule. This works well if the error rates remain small. When massive doses of radiation are administered, the DNA cannot be repaired fast enough, and the organism dies.
Molecular biologists are currently troubled by the complexity of DNA in eukaryotic cells. As we have seen, prokaryotic cells usually block their genes and have a unique key field (operator) for each gene block (operon). Eukaryotic DNA data organization is apparently much more complicated. Eukaryotes do not usually block genes. They generally locate each gene on a separate block, separated by inter-gene gaps. The internal structure of the eukaryotic gene is also peculiar. Eukaryotic genes are composed of executable DNA base sequences (exons), used to code for amino acid sequences, which are interrupted by non-executable base sequences of "nonsense DNA" called introns. Introns range from 65 bits to 100,000 bits in length, and a single gene may contain as many as 50 introns. In fact, many genes have been found to contain more intron bases than exon bases. This explains why eukaryotic genes are so much longer than prokaryotic genes. Much of the DNA in eukaryotic genes is actually intron DNA, which does not encode information for protein synthesis.
When a eukaryotic gene is transcribed, both the exons and introns of the gene are copied to a strand of pre-mRNA. The beginning of an embedded intron is signaled by a 17 bit string of bases, while the end of the intron is signaled by a 15 bit string. These sections of intron mRNA must be edited out of the pre-mRNA, before the mRNA can be used to synthesize a protein. The editing is accomplished in the nucleus by a structure called a spliceosome. The spliceosome is composed of snRNP's (pronounced "SNURPS"), whose function is to edit out the introns and splice the exons together to form a strand of mRNA suitable for protein synthesis. Once the editing process is complete, the spliceosome releases the mRNA, which slips through a nuclear pore into the cell cytoplasm where the ribosomes are located.
+ snRNP's in Spliceosome ----> mRNA with Introns Removed
The following section is from an internal paper I wrote at Amoco in March of 1987 and is unchanged from the original paper. Some of these ideas have proven true in the opinion of some molecular biologists, but the role of introns still remains a mystery.
Molecular biologists are puzzled as to why perhaps 50% of the DNA in a eukaryotic gene is composed of intron "nonsense DNA". I find it hard to believe that such an inefficient data storage method would be so prevalent. Perhaps some clues from classical data processing would be of assistance. Programmers frequently include non-executable statements in programs to convey information about the programs. These statements are called comment statements and are usually bracketed by some defining characters such as /* */. These characters tell the compiler or interpreter to edit out, or skip over, the comment statements since they are non-executable. The purpose of the comment statements is to convey information to programmers about the program, while normal program statements are used to instruct the performance of the computer. Another example of non-executable information is the control information found within relational tables. When one looks at the dump of a relational table, one can easily see the data stored by the table in the form of names, locations, quantities, etc. in the dump. One can also see a great deal of "nonsense data" mixed in with the real data. The "nonsense data" is control information which is used by the relational DBMS to locate and manage the information within the relational table. Even though the control information does not contain information about mailing lists or checking accounts, it is not "nonsense data". It only appears to be "nonsense data" to programmers who do not understand its purpose. When molecular biologists examine a "dump" of the bases in a eukaryotic gene, they also can easily identify the bases used to code for the sequence of amino acids in a protein. Perhaps the "nonsense" bases they find within the introns in the "dump" might really represent biological control information for the gene. For example, the introns might store information about how to replicate the gene or how to fold it up to become part of a chromosome.
Programmers also use comment statements to hide old code from the compiler or interpreter. For example, if a programmer changes a program, but does not want to throw away the old code, he will frequently leave the code in the program, but surround it with comment characters so that it will be skipped over by the compiler or interpreter. Perhaps introns are nature's way of preserving hard fought for genetic information, which is not currently needed, but might prove handy in the future. Or perhaps introns represent code which may be needed at a later stage of development and are involved with the aging process. The concept that introns represent non-executable control information, which is not used for protein synthesis, but is used for something we are currently unaware of seems to make more sense from the standpoint of natural selection, than does the concept of useless "nonsense DNA".
------------------------/*Keep this code in case we need it later*/------------------------------------------
One Final Very Disturbing Thought
Let me conclude our IT analysis of softwarebiology with one troubling problem. There is something definitely wrong with the data organization of complex living things, like you and me, that are composed of eukaryotic cells. Prokaryotic bacterial cells are the epitome of efficiency. As we have seen, prokaryotic bacteria generally block their genes, so that the genes can be read with a single I/O operation, while eukaryotic cells do not. Also, eukaryotic cells have genes containing large stretches of intron DNA, which sometimes is used for regulation purposes, but is not fully understood at this point. Eukaryotic cells also contain a great deal of junk DNA between the protein-coding genes, which is not used to code for proteins at all. Each human cell contains about 3 billion base pairs that code for about 20,000 – 25,000 proteins, but fully 98% of these 3 billion base pairs apparently code for nothing at all. It is just junk DNA in introns and in very long gaps between the protein-coding genes. Frequently, you will find the same seemingly mindless repeating sequences of DNA base pairs over and over in the junk DNA, going on and on, for hundreds of thousands of base pairs. Prokaryotic bacteria, on the other hand, are just the opposite. Every base pair of DNA in bacteria has a purpose. Now it takes a lot of time and energy to replicate all that junk DNA when a human cell divides and replicates; in fact it takes many hours to replicate, and it is the limiting factor in determining how quickly human cells can actually replicate. On the other hand, prokaryotic bacteria are champs at replicating. For example, E. coli bacteria come in several strains that have about 5 million base pairs of DNA, containing about 5,000 genes coding for proteins. As with all bacteria, the DNA in E. coli is just one big loop of DNA, and at top speed it takes about 30 minutes to replicate the 5 million base pairs of DNA. But E. coli can replicate in 20 minutes. How can they possibly do that? Well, when an E. coli begins to replicate its loop of DNA into two daughter loops, each of the two daughter loops also begin to replicate themselves before the mother E. coli even has a chance to finish dividing! That is how they can compress a 30 minute process into a 20 minute window. Now try that trick during your next tight IT maintenance window!
So here is the dilemma. Simple prokaryotic bacteria are the epitome of good IT database design, while the eukaryotic cells used by the “higher” forms of life, like you and me, are an absolute disaster from a database design perspective! They certainly would never pass a structured database design review. The question would constantly come up, “Why in the world would possibly want to do that???”. But as Ayn Rand cautioned, when things do not seem to make sense, be sure to “check your premises”. The problem is that you are looking at this entirely from an anthropocentric point of view. In school you were taught that your body consists of 100 trillion cells, and that these cells use DNA to create proteins that you need to replicate and operate your cells. But as Richard Dawkins explains in the The Selfish Gene (1976), this is totally backwards. We do not use genes to protect and replicate our bodies; genes use our bodies to protect and replicate genes! We are DNA survival machines! Darwin taught us that natural selection was driven by survival of the fittest. But survival of the fittest what? Is it survival of the fittest species, species variety, or possibly the fittest individuals within a species? Dawkins notes that none of these things actually replicate, not even individuals. All individuals are genetically unique, so they never truly replicate. What does replicate are genes, so for Dawkins, natural selection operates at the level of the gene. These genes have evolved over time to team up with other genes to form bodies or DNA survival machines, that protect and replicate DNA, and that is why the higher forms of life are so “inefficient” when it comes to DNA. The DNA in higher forms of life is not trying to be “efficient”, it is trying to protect and replicate as much DNA as possible. Prokaryotic bacteria are small DNA survival machines that cannot afford the luxury of taking on any “passenger” junk DNA. Only large multicellular cruise ships like us can afford that extravagance. If you have ever been a “guest” on a small sailing boat, you know that there is no such thing as a “passenger” on a small sailboat; it's always "all hands on deck" - and that includes the "guests"! Individual genes have been selected for one overriding trait, the ability to replicate, and they will do just about anything required to do so, like seeking out other DNA survival machines to mate with and rear new DNA survival machines. In Blowin’ in the Wind Bob Dylan asked the question, ”How many years can a mountain exist; Before it's washed to the sea?”. Well, the answer is a few hundred million years. But some of the genes in your body are billions of years old, and as they skip down through the generations largely unscathed by time, they spend about half their time in female bodies and the other half in male bodies. If you think about it, all your physical needs and desires are geared to ensuring that your DNA survives and gets passed on, with little regard for you as a disposable DNA survival machine - truly one of those crescent Moon epiphanies! I strongly recommend that all IT professionals read the The Selfish Gene, for me the most significant book of the 20th century, because it explains so much. For a book written in 1976, it makes many references to computers and data processing that you will find extremely interesting. Dawkins has written about a dozen fascinating books, and I have read them all, many of them several times over. He definitely goes on the same shelf as Copernicus, Galileo, Newton, and Darwin for me.
Because of all the similarities we have seen between biological and computer software, resulting from their common problems with the second law of thermodynamics and nonlinearity and their similar convergent historical evolutionary paths to solve those problems with the same techniques, we need to digress a bit before proceeding and ask the question - is there something more profound afoot? Why are biological and computer software so similar? Could they both belong to a higher classification of entities that face a commonality of problems with the second law of thermodynamics and nonlinearity? That will be the subject of the next posting, which will deal with the concept of self-replicating information. Self-replicating information is information that persists through time by making copies of itself or by enlisting the support of other things to ensure that copies of itself are made. We will find that the DNA in living things, Richard Dawkins’ memes, and computer software are all examples of self-replicating information, and that the fundamental problem of software, as outlined in the three laws of software mayhem, might just be the fundamental problem of everything.
Comments are welcome at firstname.lastname@example.org
To see all posts on softwarephysics in reverse order go to: