Tuesday, December 06, 2011

How Software Evolves

I just finished Evolutionary Analysis (2004) by Scott Freeman and Jon Herron, which is a college textbook on evolutionary biology. As I mentioned in How to Think Like a Softwarephysicist, I like to periodically read college textbooks cover-to-cover and skip all of the problem sets, quizzes, papers, and tests that college students are subject to, and instead, just concentrate on the essence of the material. Since my intention is to really understand the material, and not simply to pass tests, I think that I get much more long-term benefit out of these textbooks now than I did in my long-gone student days. I am now 60 years old, and when I read these very thick college textbooks, I now marvel at the huge amounts of time that the authors must have invested in creating them, something I certainly took for granted in my unappreciative college years. Anyway, after reading Evolutionary Analysis, I realized that in all of my previous posts on softwarephysics, which dealt with the evolution of software over the past 2.2 billion seconds, ever since Konrad Zuse cranked up his Z3 computer in May of 1941, I had described the evolutionary history of software, but I had not really explained the how and why of software evolution. So the subject of this posting will be a discussion of the evolutionary mechanisms involved that caused software to evolve.

In the SoftwarePaleontology section of SoftwareBiology, I provided a brief evolutionary history of the evolution of software on Earth. In that posting, I also explained that because both living things and software had to struggle with the second law of thermodynamics in a nonlinear Universe, that both had converged upon very similar design solutions over time. So much can be learned about the evolution of software by examining the evolution of life on Earth over the past 4.0 billion years, and in order to do that, let us briefly explore some of the concepts of modern evolutionary theory. Modern evolutionary theory is based upon the Modern Synthesis of the 1930s and 1940s, which brought together concepts from Mendelian genetics, Darwinian natural selection, population genetics, ecology and paleontology into a single theory that explained both microevolutionary and macroevolutionary observations. The Modern Synthesis is based upon four component processes – genetic mutation, genetic migration, genetic drift and natural selection, and how the interactions of these component processes change the frequencies of genes within a population. The first two processes, genetic mutation and genetic migration, introduce new genes into a population, while the second two processes, genetic drift and natural selection, tend to reduce the number of genes within a population. The net effect of these processes is to change the statistical frequencies of genes within a population, and that is really what evolution is all about because the genes within the population of a species determine what kinds of proteins a species can generate, and the kinds of proteins that are generated determine how the members of a species look and behave.

Genetic Mutation - A genetic mutation results when genetic DNA is copied and errors occur in the copying process, or a gene gets zapped by something like a cosmic ray and gets repaired incorrectly. Recall that genes are segments of DNA that are many thousands of nucleotides long, containing the A, C, G, and T nucleotides that encode the sequence of amino acids necessary to create a protein molecule. Protein molecules form the structural components within a living cell, or more frequently, protein molecules act as catalytic agents to speed up biochemical reactions in a biochemical pathway. For example, we Homo sapiens have about 25,000 genes that encode for the proteins necessary to build and operate a body, but only a few percent of the 3 billion DNA nucleotides found in our chromosomes are actually used to encode proteins. The vast majority of our DNA nucleotides are just along for the ride, and get replicated along with the necessary protein-encoding segments of DNA. This supports Richard Dawkins’ contention that living things are really DNA survival machines, which evolved to protect and replicate DNA using temporary and disposable bodies that persist for a very brief time, as they store and then pass on copies of DNA that are hundreds of millions or even billions of years old. Now thanks to the second law of thermodynamics, most mutations are deleterious because they usually produce a protein that is no longer functional (see The Demon of Software for details on the second law). It’s like having a poker hand with a full house of K-K-K-9-9, and trying to draw another K by discarding the two 9s. The odds are that you will destroy your beautiful full house instead of coming up with a four of a kind like K-K-K-K-2. However, very infrequently, a mutation can lead to a protein that does an even better job than the original protein, or it can even lead to a protein that does something completely different that just happens to be quite useful. In keeping with our poker analogy, a beneficial mutation would be like drawing a K-K-K-K-2 by discarding your 9s in a full house of K-K-K-9-9.

Genetic Migration - Genetic migration occurs when a new version of a gene moves into a population from an outside location. For example, the population of a species on an island is relatively isolated from the population of the same species back on the mainland. But every so often, members of the mainland population might raft themselves to the island on board uprooted trees that were blown down in a hurricane. If these foreign, mainland members of the species also happen to be bearing mutated versions of genes, these mainland mutations can then migrate to the island population.

Genetic Drift - The more complex forms of life have at least two copies of each gene. For example, one copy of each of your genes came from the chromosome that you inherited from your father, while the other copy came from the chromosome that you inherited from your mother. If you have at least one functional version of a gene, you generally are okay, but if you happen to have drawn two non-functional versions of a gene, then you generally are in a great deal of trouble because you cannot make one of the proteins that is necessary for good health, or perhaps, even necessary for life itself. In fact, most genetic diseases result from having two malformed versions of a gene. However, because the odds of having two bad copies of a given gene are quite small, since there are not that many bad copies floating around in a population, we all generally can make the proteins that are necessary for life because we have at least one good copy. However, in large populations there will always be a small number of fathers and mothers running around with one bad copy of a gene. The odds of a male with one bad copy hooking up with a female who also has one bad copy of the same gene will be quite small, and even if they do manage to have offspring, only about ¼ of their offspring will have the bad luck of ending up with two bad copies of the gene and suffer ill effects. So bad copies of a gene can persist in a large population because the bad copy of the gene can easily hide in bodies that also have a good copy of the gene. So in large populations, deleterious genes tend to persist. However, in small populations mutant genes tend to be weeded out by sheer chance. Because there are just a few of the mutant genes floating about in a small population, there is a good chance that none of them will survive to the next generation because of sheer bad luck. On the other hand, if a mutant gene happens to produce a protein that works nearly as well as the original version of the gene, or perhaps even slightly better, there is also the chance that the original version of the gene might go extinct by sheer bad luck as well. Thus, in small isolated populations, the frequency of various versions of genes can slowly drift away from the original frequency that the population had, as certain versions of the genes go extinct, and this is called genetic drift. Genetic drift is very important for the evolution of new species. If a small population gets isolated on an island, the frequencies of its genes can slowly drift away from the frequencies found back on the mainland, allowing a new species to arise on the island that can no longer mate with the original species back on the mainland and produce fertile offspring with it.

Natural Selection - Natural selection is the famous “survival of the fittest”. When a favorable genetic mutation occurs, or when a favorable genetic mutation migrates into a population, the statistical frequency of the favorable mutation tends to increase within the population because members of the population that have the favorable mutation tend to have a better chance of surviving and passing the favorable mutation on to their offspring. Natural selection is also very important for the evolution of new species, especially in small isolated populations under environmental stress, because natural selection can then strongly select for beneficial mutations, and these beneficial mutations do not get diluted by a large population.

The Evolutionary Processes at Work in IT
We see these same processes at work in IT when software is developed and maintained, and that is why software evolves over time. Since modern evolutionary biology is based upon the changing statistics of genes within a given population, we first need to determine what is the software equivalent of genetic material. For living things of course it is the genes composed of stretches of DNA, but for software it is source code. In order to do something useful, the information in a gene, or stretch of DNA, has to be first transcribed into a protein. This transcription process is accomplished by a number of enzymes, proteins that have a catalytic ability to speed up biochemical reactions. The sequence of operations aided by enzymes goes like this:

DNA → mRNA → tRNA → Amino Acid chain → Protein

Like DNA, the source code for a program has to be first compiled into an executable file, containing the primitive machine instructions for a computer to execute, before it can be run by a computer to do useful things. When you double-click on an icon on your desktop, like Microsoft Word, you are loading the Microsoft Word WINWORD.exe executable file into the memory of your computer where it begins to execute under a PID (process ID). After you double-click the Microsoft Word icon on your desktop, you can use CTRL-ALT-DEL to launch the Windows Task Manager, and then click on the Processes tab to find the WINWORD.exe running. This compilation process is very similar to the transcription process used to form proteins by stringing together amino acids in the proper sequence that is shown above. The output of the DNA transcription process is an executable protein that can begin processing organic molecules the moment it folds up into its usable form, and is similar to the executable file that results from compiling the source code of a program. Program source code is indeed much like DNA, but there is one subtle difference. Transcribed proteins do not have any built-in logic of their own, while executable files do. When a living cell produces a large number of protein molecules within its confines, and combines them with a large number of smaller molecules called monomers that are the building blocks of living things, wondrous things begin to happen. Biological pathways form, all on their own, from the interactions between the monomers and the enzyme proteins to form step-by-step programs to build up and process the very large molecules required by living things, and which are the basis for the large-scale biological structures found within a living cell. It’s like mixing a bunch of LEGO blocks together with many inquisitive toddlers, and allowing the toddlers to form complex LEGO structures all on their own. On the other hand, some biological pathways do just the opposite. They take complex organic molecules, like houses formed from large numbers of LEGO blocks, and break them down into their constituent LEGO block parts or monomers. So the logic in protein enzymes is an emergent quality that arises when enzymes and other organic molecules are mixed together in a cell. This emergent logic is not self-evident when just looking at the enzyme proteins on their own, but when you look at their resulting interactions with other organic molecules, one finds that the logic is indeed there, hiding in the structures of the individual enzyme proteins. The same is not true of software source code. The source code of a program has its logic built-in and in plain sight, and we can easily see the logic at work. I like to consider each variable and symbol in a line of source code to effectively be an enzyme protein or monomer molecule in a softwarechemical pathway reaction. In Quantum Software and SoftwareChemistry, I explained that just as protein molecules are composed of many kinds of atoms, in differing quantum states, all bound together into molecules, we can think of lines of source code in a similar manner.

For example consider the line of code:

discountedTotalCost = (totalHours * ratePerHour) - costOfNormalOffset;

We can consider each character in the line of code to be in one of 256 quantum ASCII states defined by 8 quantized bits, with each bit in one of two quantum states “1” or “0”, which can also be characterized as ↑ or ↓ and can be thought of as 8 electrons in 8 electron shells of an atom, with each electron in a spin up ↑ or spin down ↓ state:

C = 01000011 = ↓ ↑ ↓ ↓ ↓ ↓ ↑ ↑
H = 01001000 = ↓ ↑ ↓ ↓ ↑ ↓ ↓ ↓
N = 01001110 = ↓ ↑ ↓ ↓ ↑ ↑ ↑ ↓
O = 01001111 = ↓ ↑ ↓ ↓ ↑ ↑ ↑ ↑

Figure 1 – The electron configuration of a carbon atom is similar to the ASCII code for the letter C in the source code of a program (click to enlarge)

Thus each variable in a line of code can be considered to be a complex molecule interacting with other complex molecules.

Now let’s look at some source code. Below are three examples of the source code to compute the average of some numbers that you enter from your keyboard when prompted. The programs are written in the C, C++, and Java programming languages. Please note that modern applications now consist of many thousands to many millions of lines of code. The simple examples below are just for the benefit of our non-IT readers to give them a sense of what is being discussed when I describe the software development life cycle below from the perspective of the Modern Synthesis.

Figure 2– Source code for a C program that calculates an average of several numbers entered at the keyboard.

Figure 3 – Source code for a C++ program that calculates an average of several numbers entered at the keyboard.

Figure 4 – Source code for a Java program that calculates an average of several numbers entered at the keyboard.

In the SoftwarePaleontology section of SoftwareBiology, we saw that software has indeed evolved over time mainly in response to the environmental changes in the hardware environment in which it exists, in a similar fashion to the way that living things evolved on Earth in response to the environmental changes of a changing planet. Over the past 70 years there has been an explosive positive feedback loop going on between the evolution of hardware and software. As more complex software evolved and demanded more memory and faster processing speeds, hardware had to keep up by providing ever increasing amounts of memory and processing speed, which allowed software to demand even more. The result has been that, over the past 70 years, the amount of memory and processing speed available has exploded. It is now possible to buy a $500 PC that has a billion times the memory and runs a billion times faster than Konrad Zuse’s original Z3 computer, and today’s fastest supercomputers run about 1016 times faster than a Z3, or 10 million billion times faster. In Is Self-Replicating Information Inherently Self-Destructive?, we also saw that there have been some positive feedback loops going on between living things and the surface of the Earth as well over the past 4.0 billion years. The arrival of oxygen producing cyanobacteria about 2.8 billion years ago allowed large amounts of oxygen to eventually appear in the Earth’s atmosphere, which later proved necessary to sustain the complex life forms that arose during the Cambrian Explosion 541 million years ago. Similarly, over the past 500 million years, these complex life forms were able to remove large amounts of the greenhouse gas carbon dioxide from the Earth’s atmosphere. The complex life forms found after the Cambrian did so by producing carbonate deposits formed from shells and reefs, that were later subducted into the Earth by plate tectonics, as the Sun increased in brightness by about 5%. The removal of this vast quantity of the carbon dioxide greenhouse gas prevented the Earth’s temperature from rising to a level that could no longer support complex life forms. Note that in both cases, these feedback loops allowing for more complex software and more complex life forms were not planned. They both just happened on their own in a fortuitous manner, as software and hardware and living things and the Earth interacted with each other.

Now that we know that program source code is the equivalent of genes in an IT setting, we need to see how program source code changes over time in response to the evolutionary processes of genetic mutation, genetic migration, genetic drift and natural selection, and how these processes allow software to adapt to its changing hardware environment. In order to understand this, we need to explore how software is currently developed and maintained in an IT department. Programmers are now called developers so I will use that terminology going forward. Developers are broken up in IT departments into tribes of 5 – 30 developers working under a single chief, or IT application development manager. Each application development tribe of 5 – 30 developers is a semi-isolated population of developers, dedicated to supporting a number of applications, or possibly, a small segment of a very large application like Microsoft Word. In softwarephysics we consider developers to essentially be the equivalent of software enzymes, like the enzymes that copy DNA and fix DNA errors.

Mutation of Source Code
Just as changing the DNA sequence in a gene will likely produce a mutant version of a protein that is non-functional, changing just one character in the source code of a program will likely produce a mutant version of the program that will probably make the program non-functional. Most likely, the mutant source code for the program will not even successfully compile, but if it should compile, the resulting executable file will have a bug in it that will make the program do strange and undesirable things. In both cases, this is the result of the second law of thermodynamics at work in a nonlinear Universe. The second law of thermodynamics simply states that the number of ways of coding up a buggy program or gene is much larger than the number of ways of coding up a useful program or gene, so whenever you change a program or gene, the odds are that you are going to make it buggy or non-functional. See Entropy - the Bane of Programmers and The Demon of Software for more on the second law of thermodynamics. Nonlinear systems are systems that are very sensitive to small changes. A very small change to a nonlinear system, like changing just one character in a gene or the source code for a program, can lead to very dramatic effects. See Software Chaos for more on nonlinear systems. This makes it very difficult to write source code that works, or to produce genes that yield useful proteins. The trick when writing source code is to only make small changes to the source code that make the resulting executable file to more closely do what the program is intended to do, without introducing bugs at the same time that make the executable file do strange and undesirable things that are not intended.

Genetic Migration of Source Code
Over time, every developer acquires his or her own coding style and coding techniques. These accumulate over time, as the developer learns through trial and error, what works well and what does not. However, every so often a developer will get stumped by a particular problem and will frequently turn to other members within the development tribe of 5 – 30 developers for advice. Frequently, another member within the tribe will have some source code that can be modified slightly to solve the problem at hand, so the developer will paste this borrowed code into the source code for the program under development. Consequently, lots of old source code gets exchanged within a development tribe, just as lots of DNA tends to get exchanged within a tribe of people living on a tropical island. And like on a tropical island, every so often a new member to the development tribe will wash up onshore, bearing a new coding style and coding techniques, and lots of new source code that can also be borrowed. So just as genes tend to migrate between populations of living things, the source code for programs can migrate between development tribes. The advent of the Internet has greatly increased this migration of program source code. Thanks to Google, it is now possible to find lots of program source code on the Internet that can be modified to solve any given problem.

Genetic Drift of Source Code
As we saw with genes, many mutations to source code have no effect upon how the resulting executable files behave when they run in a computer. Indeed, it is possible to code up any given program in nearly an infinite number of ways, even using many different programming languages. For example, the three programs above, written in C, C++, and Java, all behave exactly the same when run. So over time, source code coding styles and coding techniques, and even the choice of programming languages, tends to drift for a developer and a development tribe. For example, the unstructured code of the 1950s and 1960s was replaced by the structured code of the 1970s and 1980s, which was later replaced by the object-oriented code of the 1990s. However, all of these coding techniques and their associated programming languages could still be used today to produce an executable file that performs the desired functions of a program. See the SoftwarePaleontology section of SoftwareBiology for details on the evolution of coding techniques.

Natural Selection of Source Code
Thanks to the second law of thermodynamics, most random changes to the sequence of nucleotides in a DNA gene will not generate a functional protein. In a similar fashion, most changes to the source code file for a currently functional program will not generate an executable file that still performs the desired functions, and because the Universe is nonlinear, such small coding errors in either DNA or source code files will likely produce disastrous results. For both living things and software there seems to be only one way around these two major obstacles of the second law of thermodynamics and nonlinearity. This mechanism was first brought to light by Charles Darwin and Alfred Russel Wallace in 1859. In the Modern Synthesis it goes like this. Within any given population of a species, there will always be genetic variation amongst its members caused by genetic mutations, genetic migration and genetic drift of its genes. Thanks to Mendelian genetics, the genes responsible for these genetic variations are also inheritable, and can be passed down to subsequent generations. Most of these variations are neutral or detrimental in nature when it comes to the survival of the individuals possessing them, but once in a great while, a genetic variation will be found to be beneficial, and give individuals carrying such genes a better chance at surviving and passing these new beneficial genes on to their offspring. Darwin called this process natural selection, because it reminded him of the artificial selection process used by breeders of domesticated animals. Darwin noted that by allowing domesticated animals with certain desirable traits to only breed with other domesticated animals with similar desirable traits, breeders were able to produce domesticated animals with far superior traits compared to their wild ancestors. For example, by only allowing turkeys and pigs with desirable traits to breed with other turkeys and pigs with desirable traits, breeders over the centuries managed to produce the modern turkeys and pigs of today, which are capable of producing far more meat than their distant ancestors. In a similar manner, Nature automatically selects for members of a species that are better at surviving, and allows them to pass on their desirable genetic traits to their offspring. As these small changes to traits accumulate within a population, eventually a new species will arise, especially in small isolated populations. This is the famous “survival of the fittest” concept of Darwin’s natural selection, first coined by the British philosopher Herbert Spencer.

The Software Development Life Cycle From the Perspective of the Modern Synthesis
With all of this background information at hand, let us now see how a developer goes about producing a new piece of software, in a manner similar to how Nature goes about producing a new species. Developers never code up the source code for new software from scratch. Instead, developers take old existing code from previous applications, or from the applications of others in their development tribe, or perhaps even from the Internet itself as a starting point and then use the Darwinian processes of innovation and natural selection to evolve the software into the final product. So most new applications inherit lots of old code from ancestral applications that were successful and survived the development process to ultimately end up in production. The source code for applications that died before reaching production usually ends up getting deleted, so the source code for new applications generally comes from the surviving “winners” of a population. The developer then begins a tedious life cycle process consisting of evolving the new software over many thousands of generations:

borrow some old code → modify code → test → fix code → test → borrow some more old code → modify code → test → fix code → test ....

During this very long evolutionary development process, frequently more old code from other existing applications is also introduced, as the new software slowly progresses towards completion. In this development process we see all of the elements of the Modern Synthesis. The source code for new software inherits source code from the successful software of the past, which might come from the native stock found within a development tribe, or from source code that has recently migrated into the development tribe from outside. As the source code is developed by mutating the code in small incremental steps, natural selection determines which version of the source code ultimately passes on to the next step or generation in the development process at each point that the new source code is tested. Source code is even subject to genetic drift within a development tribe, as coding styles arbitrarily change with time and new computer languages are adopted by the development group. Once an application is in production, the development life cycle continues on as the application enters into a maintenance mode. Bugs constantly need to be corrected and additional features need to be added, and this is accomplished using the same process outlined above of introducing and modifying existing code from other applications, mutating the original source code of the application with small changes, and constantly using natural selection to test how closely the application has come to correcting a bug or adding a new feature every time the code is changed. Thus, the software continues to evolve through small incremental changes over many thousands of testing generations, until the desired result is achieved. Because all of the software throughout the world is currently being worked upon by millions of developers, all at the same time, and the time interval between generational tests during the development and maintenance cycles might be only a matter of a few minutes or seconds, software has tended to evolve over the past 70 years about 60 million times faster than life on Earth.

In addition to application code, the code for the software infrastructure has also evolved over time in a similar manner. Infrastructure developers work on the source code for things like operating systems like Windows and Unix, compilers for new computer languages like the compilers for C, C++, Java and other languages, J2EE appservers like Websphere, webservers like Apache, database management systems like Oracle and DB2, transaction monitors like CICS, and security software like LDAP. And these software infrastructure elements also evolve over time because of the same evolutionary processes outlined above. Notice that all three programs in the figures above that compute the average of a series of numbers entered via a keyboard are very similar. That is because the Java programming language (1995) evolved from the C++ programming language (1983), which evolved from the C programming language (1973).

Thus, because all of the same evolutionary processes are at work for both living things and software, there should be no surprise that both have evolved in a similar manner, and that both have converged upon very similar solutions to overcome both the second law of thermodynamics and nonlinearity.

Comments are welcome at scj333@sbcglobal.net

To see all posts on softwarephysics in reverse order go to:

Steve Johnston

No comments: