As we have seen in many previous softwarephysics postings, commercial software has been evolving about 100 million times faster than biological software over the past 70 years, or 2.2 billion seconds, ever since Konrad Zuse cranked up his Z3 computer in May of 1941, and that the architectural history of commercial software has essentially recapitulated the evolutionary history of life on Earth over this period of time through a process of convergence. Over the years, the architecture of commercial software has passed through a very lengthy period of prokaryotic architecture (1941 – 1972), followed by a period of single-celled eukaryotic architecture (1972 – 1992). Multicellular organization took off next with the Object Oriented revolution of the early 1990s, especially with the arrival of Java in 1995. And about 10 years ago, commercial software entered into a Cambrian explosion of its own with the advent of SOA (Service Oriented Architecture) in which large-scale multicellular applications first appeared, chiefly in the form of high-volume corporate websites. For more on this see the SoftwarePaleontology section of:
This convergence of commercial software and biological software upon the very same architectural solutions to overcome the effects of the second law of thermodynamics in a nonlinear Universe provides IT professionals with a unique opportunity to observe the evolution of commercial software in real time, and over the course of an entire career, personally experience the macroevolution of software in action. For example, when I first started programming in 1972, I was taught how to code unstructured prokaryotic code. A few years later I ran across my very first structured eukaryotic program written by a coworker who was just learning to write structured programs at the University of Houston. At the time, I was shocked by the structured eukaryotic code and did not know exactly what to make of it. Twenty years after that shock, object-oriented multicellular code came along, and once again I was shocked. Currently, IT is about 10 years into the Cambrian explosion of SOA (Service Oriented Architecture), and once again IT is struggling.
The Evolution of Data Storage
Along with this architectural evolution of software architecture over time there has also been a dramatic evolution in how commercial software stores and accesses data that I believe also has some significant relevance to biology. In IT we call the way data is stored and accessed an “access method”, and we have primarily evolved two ways of storing and accessing data, either sequentially or via indexes. With a sequential access method data is stored in records, with one record following another, like a deck of index cards with the names and addresses of your friends and associates on them. In fact, the very first sequential files were indeed large decks of punch cards. Later, the large “tub files” of huge decks of punch cards were replaced by storing records sequentially on magnetic tape.
Figure 1 - Each card was composed of an 80 byte record that could hold 80 characters of data with the names and addresses of customers.
Figure 2 – The very first sequential files were large decks of punch cards composed of one record after another.
An indexed access method works like the index at the end of a book and essentially tells you where to find a particular record in a deck of index or punch cards. In the discussion below we will see how moving from sequential access methods, running on magnetic tape, to indexed access methods, running on disk drives, prompted a dramatic revolution in commercial software architecture, allowing commercial software to move from the batch processing of the 1950s and 1960s to the interactive processing of the 1970s and beyond and which is the most prominent form of processing today. However, even today commercial software still uses both sequential and indexed access methods because in some cases the batch processing of sequential data makes more sense than the interactive processing of indexed data. Later in this posting we will see that in biology there was also a similar revolution in data storage technology when the simple DNA loops found within single-celled prokaryotic bacteria and archaea evolved into the complex DNA organization of the eukaryotes using chromatin and chromosomes to store DNA.
Sequential Access Methods
One of the simplest and oldest sequential access methods is called QSAM - Queued Sequential Access Method:
Queued Sequential Access Method
I did a lot of magnetic tape processing in the 1970s and early 1980s using QSAM. At the time we used 9 track tapes that were 1/2 inch wide and 2400 feet long on a reel with a 10.5 inch diameter. The tape had 8 data tracks and one parity track across the 1/2 inch tape width. That way we could store one byte across the 8 1-bit data tracks in a frame, and we used the parity track to check for errors. We used odd parity, if the 8 bits on the 8 data tracks in a frame added up to an even number of 1s, we put a 1 in the parity track to make the total number of 1s an odd number. If the 8 bits added up to an odd number of 1s, we put a 0 in the parity track to keep the total number of 1s an odd number. Originally, 9 track tapes had a density of 1600 bytes/inch of tape, with a data transfer rate of 15,000 bytes/second. Remember, a byte is 8 bits and can store one character, like the letter “A” which we encode in the ASCII code set as A = “01000001”.
Figure 3 – A 1/2 inch wide 9 track magnetic tape on a 2400 foot reel with a diameter of 10.5 inches
Figure 4 – 9 track magnetic tape had 8 data tracks and one parity track using odd parity which allowed for the detection of bad bytes with parity errors on the tape.
Later, 6250 bytes/inch tape drives became available, and I will use that density for the calculations that follow. Now suppose you had 50 million customers and the current account balance for each customer was stored on an 80 byte customer record. A record was like a row in a spreadsheet. The first field of the record was usually a CustomerID field that contained a unique customer ID like a social security number, and was essentially the equivalent of a promoter region on the front end of a gene in DNA. The remainder of the 80 byte customer record contained fields for the customer’s name and billing address, along with the customer’s current account information. Between each block of data on the tape there was a 0.5 inch gap of “junk” tape. This “junk” tape allowed for the acceleration and deceleration of the tape reel as it spun past the read/write head of a tape drive and perhaps occasionally reversed direction. Since an 80 byte record only came to 80/6250 = 0.0128 inches of tape, which is quite short compared to the overhead of the 0.5 inch gap of “junk” tape between records, it made sense to block many records together into a single block of data that could be read by the tape drive in a single I/O operation. For example, blocking 100 80 byte records increased the block size to 8000/6250 = 1.28 inches and between each 1.28 inch block of data on the tape there was the 0.5 inch gap of “junk” tape. This greatly reduced the amount of wasted “junk” tape on a 2400 foot reel of tape. So each 100 record block of data took up a total of 1.78 inches of tape and we could get 16,180 blocks on a 2400 foot tape or the data for 1,618,000 customers per tape. The advantage of QSAM, over an earlier sequential access method known as BSAM, was that you could read and write an entire block of records at a time via an I/O buffer. In our example, a program could read one record at a time from an I/O buffer which contained the 100 records from a single block of data on the tape. When the I/O buffer was depleted of records, the next 100 records were read in from the next block of records on the tape. Similarly, programs could write one record at a time to the I/O buffer, and when the I/O buffer was filled with 100 records, the entire I/O buffer with 100 records in it was written as the next block of data on an output tape.
The use of a blocked I/O buffer provided a significant distinction between the way data was physically stored on tape and the way programs logically processed the data. The difference between the way things are physically implemented and the way things are logically viewed by software is a really big deal in IT. The history of IT over the past 70 years has really been a history of logically abstracting physical things through the increasing use of layers of abstraction, to the point where today, IT professionals rarely think of physical things at all. Everything just resides in a logical “Cloud”. I think that taking more of a logical view of things, rather than taking a physical view of things, would greatly help biologists at this point in the history of biology. Biologists should not get so hung up about where the information for biological software is physically located. Rather, biologists should take a cue from IT professionals, and start thinking more of biological software in logical terms, rather than physical terms.
Figure 5 – Between each record, or block of records, on a magnetic tape there was a 0.5 inch gap of “junk” tape. The “junk” tape allowed for the acceleration and deceleration of the tape reel as it spun past the read/write head on a tape drive. Since an 80 byte record only came to 80/6250 = 0.0128 inches, it made sense to block many records together into a single block that could be read by the tape drive in a single I/O operation. For example blocking 100 80 byte records increased the block size to 8000/6250 = 1.28 inches, and between each 1.28 inch block of data on the tape there was a 0.5 inch gap of “junk” tape for a total of 1.78 inches per block.
Figure 6 – Blocking records on tape allowed data to be stored more efficiently.
So it took 31 tapes to just store the rudimentary account data for 50 million customers. The problem was that each tape could only store 123 MB of data. Not too good, considering that today you can buy a 1 TB PC disk drive that can hold 8525 times as much data for about $100! Today, you could also store about 67 times as much data on a $7.00 8 GB thumb drive. So how could you find the data for a particular customer on 74,000 feet (14 miles) of tape? Well, you really could not do that reading one block of data at a time with the read/write head of a tape drive, so we processed data with batch jobs using lots of input and output tapes. Generally, we had a Master Customer File on 31 tapes and a large number of Transaction tapes with insert, update, and delete records for customers. All the tapes were sorted by the CustomerID field, and our programs would read a Master tape and a Transaction tape at the same time and apply the inserts, updates and deletes on the Transaction tape to a new Master tape. So your batch job would read a Master and Transaction input tape at the same time and would then write to a single new Master output tape. These batch jobs would run for many hours, with lots of mounting and unmounting of dozens of tapes.
Figure 7 – Batch processing of 50 million customers took a lot of tapes and tape drives.
Clearly, this technology would not work for a customer calling in and wanting to know his current account status at this very moment. The solution was to use multiple transcription sites along the 14 miles of tape. This was accomplished by moving the customer data to disk drives. A disk drive is like a stack of old phonograph records on a rapidly rotating spindle. Each platter has its own access arm, like the tone arm on an old turntable that has a read/write head. To quickly get to the data on a disk drive IT invented new access methods that used indexes, like ISAM and VSAM. These hierarchical indexes work like this. Suppose you want to find one customer out of 50 million via their CustomerID. You first look up the CustomerID in a book that only contains an index of other books. The index entry for the particular CustomerID tells you which book to look in next. The next book also just consists of an index of other books too. Finally, after maybe 4 or 5 reads, you get to a book that has an index of books with “leaf” pages. This index tells you what book to get next and on what “leaf page” you can find the customer record for the CustomerID that you are interested in. So instead of spending many hours reading through perhaps 14 miles of tape on 31 tapes, you can find the customer record in a few milliseconds and put it on a webpage. For example, suppose you have 200 customers instead of 50 million and you would like to find the information on customer 190. If the customer data were stored as a sequential file on magnetic tape, you would have to read through the first 189 customer records before you finally got to customer 190. However, if the customer data were stored on a disk drive, using an indexed sequential access method like ISAM or QSAM, you could get to the customer after 3 reads that get you to the leaf page containing records 176 – 200, and you would only have to read 14 records on the leaf page before you got to record 190. For more on these indexed access methods see:
ISAM Indexed Sequential Access Method
VSAM Virtual Storage Access Method
Figure 8 – Disk drives allowed for indexed access methods like ISAM and VSAM to quickly access an individual record.
Figure 9 – To find customer 190 out of 200 on a magnetic tape would require sequentially reading 189 customer records. Using the above hierarchical index would only require 3 reads to get to the leaf page containing records 176 – 200. Then an additional 14 reads would get you to customer record 190.
The key advance that came with the ISAM and VSAM access methods over QSAM was that it allowed commercial software to move from batch processing to interactive processing in the 1970s and 1980s. That was a major revolution in IT.
Today we store all commercial data on relational databases, like IBM’s DB2 or Oracle’s database software, but these relational databases still use hierarchical indexing like VSAM under the hood. Relational databases logically store data on tables. A table is much like a spreadsheet and contains many rows of data that are formatted into a number of well-defined columns. A large number of indexes are then formed using combinations of data columns to get to a particular row in the table. Tables can also be logically joined together into composite tables with logical rows of data that contain all of the data on several tables merged together, and indexes can be created on the joined tables to allow programs to quickly access the data. For large-scale commercial software these relational databases can become quite huge and incredibly complicated, with huge numbers of tables and indexes, forming a very complicated nonlinear network of components, and the database design of these huge networks of tables and indexes is crucial to processing speed and throughput. A large-scale relational database may contain several thousand tables and indexes, and a poorly designed relational database design can be just as harmful to the performance of a high-volume corporate website as buggy software. A single corrupted index can easily bring a high-volume corporate website crashing down, resulting in the loss of thousands of dollars for each second of down time.
Figure 10 – Modern relational databases store data on a large number of tables and use many indexes to quickly access the data in a particular row of a table or a row in a combination of joined tables. Large-scale commercial applications frequently have databases with several thousand tables and several thousand indexes.
But remember, under the hood, these relational databases are all based upon indexed access methods like VSAM, and VSAM itself is just a logical view of what is actually going on in the software controlling the disk drives themselves, so essentially we have a lengthy series of logical indexes of logical indexes, of logical indexes, of logical indexes…. The point is that in modern commercial software there is a great deal of information stored in the network of components that is used to determine how information is read and written. If you dig down deep into the files running a relational database, you can actually see things like the names and addresses of customers, but you will also find huge amounts of control information that lets programs get to those names and addresses efficiently, and if any of that control information gets messed up your website comes crashing down.
Biological Access Methods
Nearly all biological functions are performed by proteins. A protein is formed by combining 20 different amino acids into different sequences, and on average it takes about 400 amino acids strung together to form a functional protein. The information to do that is encoded in base pairs running along a strand of DNA. Each base can be in one of four states – A, C, G, or T, and an A will always be found to pair with a T, while a C will always pair with a G. So DNA is really a 2 track tape with one data track and one parity track. For example, if there is an A on the DNA data track, you will find a T on the DNA parity track. This allows not only for the detection of parity errors, but also for the correction of parity errors in DNA by enzymes that run up and down the DNA tape looking for parity errors and correcting them.
Figure 11 – DNA is a two track tape, with one data track and one parity track. This allows not only for the detection of parity errors, but also for the correction of parity errors in DNA by enzymes that run up and down the DNA tape looking for parity errors and correcting them.
Now a single base pair can code for 4 different amino acids because a single base pair can be in one of 4 states. Two base pairs can code for 4 x 4 = 16 different amino acids, which is not enough. Three base pairs can code for 4 x 4 x 4 = 64 amino acids which is more than enough to code for 20 different amino acids. So it takes a minimum of three bases to fully encode the 20 different amino acids, leaving 44 combinations left over for redundancy. Biologists call these three base pair combinations a “codon”, but a codon really is just a biological byte composed of three biological bits or base pairs that code for an amino acid. Actually three of the base pair combinations, or codons, are used as STOP codons – TAA, TAG and TGA which are essentially end of file markers designating the end of a gene along the sequential file of DNA. As with magnetic tape, there is a section of “junk” DNA between genes along the DNA 2 track tape. According to Shannon’s equation, a DNA base contains 2 bits of information, so a codon can store 6 bits. For more on this see Some More Information About Information.
Figure 12 – Three bases combine to form a codon, or a biological byte, composed of three biological bits, and encodes the information for one amino acid along the chain of amino acids that form a protein.
The beginning of a gene is denoted by a section of promoter DNA that identifies the beginning of the gene, like the CustomerID field on a record, and the gene is terminated by a STOP codon of TAA, TAG or TGA. Just as there was a 0.50 inch gap of “junk” tape between blocks of records on a magnetic computer tape, there is a section of “junk” DNA between each gene along the 6 feet of DNA tape found within human cells.
Figure 13 - On average, each gene is about 400 codons long and ends in a STOP codon TAA, TAG or TGA which are essentially end of file markers designating the end of a gene along the sequential file of DNA. As with magnetic tape, there is a section of “junk” DNA between genes which is shown in grey above.
In order to build a protein, genes are first transcribed to an I/O buffer called mRNA. The 2 track DNA file for a gene is first opened near the promoter of a gene and an enzyme called RNA polymerase then begins to copy the codons or biological bytes along the data track of the DNA tape to an mRNA I/O buffer. The mRNA I/O buffer is then read by a ribosome read/write head as it travels along the mRNA I/O buffer. The ribosome read/write head reads each codon or biological byte of data along the mRNA I/O buffer, and writes out a chain of amino acids as tRNA brings in one amino acid after another in the sequence specified by the mRNA I/O buffer.
Figure 14 - In order to build a protein, genes are first transcribed to an I/O buffer called mRNA. The 2 track DNA file for a gene is first opened near the promoter of a gene and an enzyme called RNA polymerase then begins to copy the codons or biological bytes along the data track of the DNA tape to the mRNA I/O buffer. The mRNA I/O buffer is then read by a ribosome read/write head as it travels along the mRNA I/O buffer. The ribosome read/write head reads each codon or biological byte of data along the mRNA I/O buffer and writes out a chain of amino acids as tRNA brings in one amino acid after another in the sequence specified by the mRNA I/O buffer.
The Bizarre Nature of Eukaryotic Access Methods
From an IT perspective, I have always marveled at the dramatic change in the data storage technology used by the eukaryotes, compared to the simple DNA loops of the prokaryotes. For me it has always seemed very reminiscent of the dramatic change in the data storage architecture of commercial software that took place when commercial software shifted from the batch processing of sequential files on magnetic tape to the interactive processing of data on disk drives using indexed access methods like ISAM, VSAM and relational databases.
The prokaryotes store their genes in one large loop of DNA and in a number of smaller loops of DNA called plasmids. The plasmids are easier to share with other bacteria than the whole main loop of bacterial DNA, and provide for a rudimentary form of sexual reproduction amongst prokaryotes when they exchange plasmids.
Figure 15 – Prokaryotes, like bacteria and archaea, store their DNA in simple loops like magnetic computer tape.
In contrast to the prokaryotes, in eukaryotic cells there are three general levels of chromatin organization:
1. DNA first wraps around histone proteins, like magnetic computer tape around little reels, forming nucleosomes.
2. Multiple nucleosomes wrap up into a 30 nm fiber consisting of compact nucleosome arrays.
3. When cells are dividing the 30 nm fibers fold up into the familiar chromosomes that can be seen in optical microscopes.
In eukaryotic cells the overall structure of chromatin depends upon the stage that a cell is in its life cycle. Between cell divisions, known as the interphase, the chromatin is more loosely structured to allow RNA and DNA polymerases to easily transcribe and replicate the DNA. During the interphase, the DNA of genes that are “turned on” is more loosely packaged than is the DNA of the genes that have been “turned off”. This allows RNA polymerase enzymes to access the “turned on” DNA easier and then transcribe it to an mRNA I/O buffer that can escape from the nucleus and be translated into a number of identical protein molecules by several ribosome read/write heads.
This very complex structure of DNA in eukaryotic cells, composed of chromosomes and chromatin, has always puzzled me. I can easily see how the organelles found within eukaryotic cells, like the mitochondria and chloroplasts, could have become incorporated into eukaryotic cells based upon the Endosymbiosis theory of Lynn Margulis, which holds that the organelles of eukaryotic cells are the remainders of invading parasitic prokaryotic cells that took up residence within proto-eukaryotic cells, and entered into a parasitic/symbiotic relationship with them, but where did the eukaryotic nucleus, chromosomes, and the complex structure of chromatin come from? Bacteria do not have histone proteins, but certain archaea do have histone proteins, so the origin of these complex structures in eukaryotic cells might have arisen from bacteria invading the cells of certain archaea. The archaea are known for their abilities to live under extreme conditions of heat and salinity, so the origin of histone proteins and chromatin might go back to a need to stabilize DNA under the extreme conditions that archaea are fond of.
For more on the origin and functions of histone proteins see: http://en.wikipedia.org/wiki/Histone
Figure 16 – Eukaryotic DNA is wrapped around histone proteins like magnetic computer tape wrapped around little reels, forming nucleosomes, and then is packed into chromatin fibers that are then wound up into chromosomes.
Figure 17 – Chromatin performs the functions of the tape racks of old and allows DNA to be highly compacted for storage and also allows for the controlled expression of genes by means of epigenetic factors in play. Each tape in a rack had an external label known as a volume serial number which identified the tape.
As of yet nobody really knows why the eukaryotes have such a complicated way of storing DNA, but my suspicion is that eukaryotic cells may have also essentially come up with a hierarchical indexing of DNA by storing genes in chromatin on chromosomes and by attaching proteins and other molecules to the compacted DNA, and by using other epigenetic techniques, to enhance or suppress transcription of genes into proteins. IT professionals always think of software in terms of a very large and complex network of logical operations, and not simply as a parts list of software components. In commercial software it is the network of logical operations that really counts and not the individual parts. Now that biologists have sequenced the genomes of many species and found that large numbers of them essentially contain the same genes, they are also beginning to realize that biological software cannot simply be understood in terms of a list of genes. After all, what really counts in a multicellular organism is the kinds of proteins each cell generates and the amounts of those proteins that it generates after it has differentiated into a unique cell type. And biologists are discovering that biological software uses many tricks to control which proteins are generated and in what amounts, so the old model of “one gene produces one protein, which generates one function” has been replaced by the understanding that there are many tricky positive and negative feedback loops in operation that come into play as the information encoded in genes eventually becomes fully-formed 3-D protein molecules. For example, genes in eukaryotic cells are composed of sections of DNA known as exons that code for sequences of amino acids. The exons are separated by sections of “junk” DNA known as introns. The exons can be spliced together in alternative patterns, allowing a single gene to code for more than one protein. Also eukaryotic cells have much more “junk” DNA between genes than do the prokaryotes and this “junk” DNA also gets transcribed to microRNA or miRNA. An average miRNA strand is about 22 bases long and it can base-pair with a complimentary section of mRNA that has already been transcribed from a DNA gene. Once that happens the mRNA can no longer be read by a ribosome read/write head and the strand of mRNA becomes silenced and eventually degrades without ever producing a protein. The human genome contains more than 1,000 miRNAs that can target about 60% of human genes, so miRNAs play a significant role in regulating gene expression.
Figure 18 – Genes in eukaryotic cells are composed of sections of DNA known as exons that code for sequences of amino acids. The exons are separated by sections of “junk” DNA known as introns. The exons can be spliced together in alternative patterns, allowing a single gene to code for more than one protein.
The key point is that the DNA access methods used by eukaryotic cells are probably just as important, or perhaps even more important to eukaryotic cells, than is the content of the individual genes themselves because the network of genes probably contains as much information as the individual genes themselves.
The Role of Biological Access Methods in the Development of Cancer
Using the above concepts about the important role of access methods in commercial software, I would now like to present an IT perspective on the origin of cancer that is largely based upon the work of Professor Henry Heng of Wayne State University. This IT model for the origin of cancer proposes that cancer results from corrupted biological access methods acting upon the DNA of eukaryotic cells, as Professor Heng has convincingly demonstrated in a biological context.
In recent years, some cancer researchers, like Professor Heng, have forsaken the conventional gene-centric theory for the development of cancer for a more complicated theory that looks to the disruption of the entire genome of a cell as the ultimate culprit. This new theory of cancer essentially looks to the mutation of entire chromosomes, rather than the mutation of individual genes, as the root cause of cancer. The conventional theory for the development of cancer goes back about 40 years to President Nixon’s original “war on cancer”, and is very gene-centric. It arose out of the simple “one gene produces one protein, which produces one biological function” model that was so prevalent 40 years ago. For the origin of cancer this model was extended to the “mutation of one oncogene produces one deformed protein, which produces one malignant functional step along the path for a cell to become a full-blown cancerous tumor”. In this model, the slow progression of a cell from a normal state to a cancerous state is seen as a progressive series of mutations of oncogenes within the cell, until the cell is finally so burdened with deformed proteins that it begins to grow explosively and in an uncontrolled manner, and eventually metastasizes and spreads to other organs throughout the body. Over many decades, this model has had very limited success in finding treatments for cancer, and certainly has not led to a cure for cancer. A good discussion of this alternative model, based upon the corruption of the entire genome of a cell by means of large-scale alterations to the chromosomes within the cell, and the subsequent disruption of the entire network of feedback loops that regulate the expression of genes within the genome of a cell can be found in the Scientific American article at:
Chromosomal Chaos and Cancer
All IT professionals know that one of the best ways to fully understand the complex network of logical operations found within commercial software is to troubleshoot software when it gets sick and in trouble. And it seems that cancer researchers are also beginning to learn this same lesson by observing that cancer seems to arise out of failures in the complex networks of logical operations that are used by biological systems to turn the information in DNA into fully-formed functional 3D protein molecules. With this new theory of cancer, the old reductionist model of the “mutation of one oncogene produces one deformed protein, which produces one malignant functional step along the path for a cell to become a full-blown cancerous tumor” must be rethought in terms of the chaotic behavior of a very nonlinear network of interacting genes within the entire genome itself.
The Origin of Cancer in Commercial Software
Before proceeding further, we need to first examine how cancer arises within commercial software. Recall that cancer is the uncontrolled growth of deformed cells that form tumors that eventually spread throughout the entire body of a multicellular organism and eventually kills it. Multicellular organization in commercial software is based upon the use of object-oriented programming languages. Object-Oriented programming actually began in 1962, but it did not catch on at first. In the late 1980s, the use of the very first significant object-oriented programing language, known as C++, started to appear in corporate IT, but object-oriented programming really did not become significant in IT until 1995 when both Java and the Internet Revolution arrived at the same time. The key idea in object-oriented programming is naturally the concept of an object. An object is simply a cell. Object-oriented languages use the concept of a Class, which is a set of instructions to build an object (cell) of a particular cell type in the memory of a computer. Depending upon whom you cite, there are several hundred cell types in the human body, but in IT we generally use many thousands of cell types or Classes in commercial software. For a brief overview of these concepts go to the webpage below and follow all of the links by clicking on them.
Lesson: Object-Oriented Programming Concepts
A Class defines the data that an object stores in memory and also the methods that operate upon the object data. Remember, an object is simply a cell. Methods are like biochemical pathways that consist of many lines of code. A public method is a biochemical pathway that can be invoked by sending a message to a particular object, like using a ligand molecule secreted from one object to bind to the membrane receptors on another object. This binding of a ligand to a public method of an object can then trigger a cascade of private methods within an object or cell.
Now when a high-volume corporate website, consisting of many millions of lines of code running on hundreds of servers, starts up and begins taking traffic, billions of objects (cells) are instantiated in the memory of the servers. These objects then begin exchanging messages with each other to invoke methods (internal biochemical pathways) that are executed within the objects. This creates a very complex nonlinear network of interacting objects that can only be understood as a complex nonlinear dynamical system, with many strange emergent behaviors that cannot be understood by simply looking at the millions of lines of code in isolation. When troubleshooting a sick corporate website, we have to analyze the interactions of the billions of objects as a whole, and not focus on any particular line of code at first. Only later, if we do finally determine the root cause of a problem, and about 50% of the time we never do find the root cause of a problem, can we make code changes to fix the problem. Fixing a single problem might require changing lots of code for lots of objects. This is IT’s greatest challenge with the large-scale multicellular applications that have come with the SOA (Service Oriented Architecture) revolution. And as of yet, we are not very good at maintaining homeostasis of these complex networks of objects. For more about SOA please see:
Commercial software suffers from two major forms of disease. The first type of software disease can usually be understood in reductionist terms, like the conventional theory of cancer, while the second type of software disease cannot. The first type occurs when a new release of software is installed. Like many corporations, my current employer has two major Datacenters that are identical twins, essentially running the same hardware and software. Each Datacenter is sized so that it can handle our peak load which normally occurs around 10:00 AM CT each day. So we do installs of new software at night when our processing load is much lower than during the day, and perhaps a team of 10 people will be involved in a major install which might run for 6 – 10 hours. We first move all of the traffic to one of the twin Datacenters, and then we install the new software into the Datacenter that is not taking any traffic. A team of validators will then run the new software through its paces to ensure that it was installed properly and that no last-minute bugs pop up. We then move all of the traffic to the Datacenter running the new software and watch for problems for about an hour. This is when the first type of software disease can arise. Under the load from many thousands of concurrent users, bugs can pop up in the new software. If that happens, we just move all of the traffic back to the Datacenter running the old software and back out all of the new software in the sick Datacenter. However, if no problems arise after an hour or so, we reverse the whole process and install the new software into the other Datacenter too. When an install does fail, we all do an autopsy on the Datacenter that was running the new software, and by using the data from monitoring tools and log files we can usually find the bugs in the thousands of lines of code that were changed with the install. Reductionism works very well with this type of software disease, and IT usually can find a “root cause” for the problem. Cancer researchers using the conventional model for cancer have been trying to use a similar reductionist approach to finding the root cause for cancer for about 40 years. For them, it is just a matter of finding a few DNA bugs in the genes of the afflicted.
The second type of software disease is much more difficult to handle and occurs spontaneously between major installs of new software. We will be running the same software on the same hardware in both Datacenters for several days with each Datacenter taking about 50% of the load. Then all of a sudden, a large number of software components on our monitoring console for one of the Datacenters will go from Green to Yellow to Red, signaling a website outage in one of the Datacenters, and hundreds of problem tickets will come flooding into the MidOps problem queue. Our Command Center will also see the Red lights on the affected Datacenter and will promptly move all traffic to the other Datacenter because a major corporation can lose many thousands or perhaps even many hundreds of thousands of dollars every second that commercial software under heavy load goes down. A website outage conference call will then be convened, and the programmers in Applications Development and Operations people in MidOps, UnixOps, Oracle DBA, NetOps and SANOps will join to try to figure out why one of the two twin Datacenters got sick. This is very perplexing because both Datacenters are supposed to be running with the very same software and hardware and also 50% of the load. IT management wants to know the “root cause” of the problem, so everybody starts to dig through the log files and the historical displays of our monitoring tools looking for the “root cause”. About 50% of the time we do find a “root cause” which usually is the result of a “little” install from the previous night. But about 50% of the time we do not find a “root cause”. Instead, we simply restart all of the affected software components. The restart kills all of the sick software objects and restores the Datacenter to good health, so that traffic can be once again directed to the Datacenter.
One of the most common forms of this second type of software disease is called an OOM (Out Of Memory) condition. As I mentioned previously, there are several billion objects (cells) running at any given time within the middleware (mesoderm) tissues of a major corporation. These objects are created and destroyed as users login and then later leave a corporate website. These objects reside in JVMs (Java Virtual Machines) and the JVMs periodically run a “garbage collection” task every few minutes to release the memory used by the “dead” objects so that new “live” objects can be created. Many times, for seemingly unknown reasons, objects in the JVMs refuse to die, and begin to proliferate in a neoplastic and uncontrolled manner, similar to the cells in a cancerous tumor, until the JVM finally runs out of memory and can no longer create new objects. The JVM essentially dies at this point and generates a heap dump file. MidOps has a tool that allows us to look at the heap dump of the JVM that died. The tool is much like the microscope that my wife used to look at the frozen and permanent sections of a biopsy sample when she was a practicing pathologist. The heap dump will show us information about the tens of millions of objects that were in the JVM at the time of its death. Object counts in the heap dump will show us which objects metastasized, but will not tell us why they did so. So after a lot of analysis by a lot of people, nobody can really figure out why the OOM event happened and that does not make IT management happy. IT management always wants to know what the “root cause” of the problem was so that we can remove it. I keep trying to tell them that it is like trying to find the “root cause” of a thunderstorm! Yes, we can track the motions of large bodies of warm and cold air intermingling over the Midwest, but we cannot find the “root cause” of a particular thunderstorm over a particular suburb of Chicago because the thunderstorm is an emergent behavior of a complex nonlinear network of software objects. See Software Chaos for more details.
There are many potential causes for an OOM event, but one possible cause can arise from “data corruption” which can also lead to a great deal of other very serious IT diseases. I would estimate that about 10% of website outages result from data corruption, and unlike other website problems which might just lead to “slowness”, data corruption problems are generally catastrophic in nature, creating a Code Red hard down situation. The funny thing about data corruption is that the data is usually still some place on the disk drives, but you can no longer get to the data properly, or instead of getting to the data that you need, you get the wrong or garbled data because the mechanisms controlling the access methods used to get to the data are corrupted. As with chromosomal chaos, it appears that the genes or “data” are still some place on the chromosomes of a cell, but the cell can no longer get to the data properly, and that leads to malignancy. When we have a data corruption problem with one of our databases, our DBAs (Data Base Administrators) go through a data recovery process to restore the database, essentially they put the genes back onto the chromosomes in their proper locations based upon database checkpoints and transaction logs. However, even small businesses and individual PC users can experience the joys of data corruption, like losing your Ph.D. thesis on a corrupted disk drive. In that case you can mail your disk drive to a company that will try to recover the data from the corrupted disk drive. I found this website that describes such a company and the services that it provides:
Secure Data Recovery
Professor Heng’s 4D-Genomics
I recently attended the 2014 winter session of Professor Dick Gordon’s Embryo Physics course which met every Wednesday afternoon at 4:00 PM CT in a Second Life virtual world session. For more on this very interesting ongoing course please see:
One of the very interesting presentations was given by Professor Henry Heng:
4D-Genomics: The genome dynamics and constraint in biology
Professor Heng is a cancer researcher at Wayne State University and runs the Heng Laboratory there:
Professor Heng’s 4D-Genomics model looks to the human genome as much more than simply a collection of 23,000 genes residing on some small percentage of the 3 billion base pairs of DNA strung out along the 6 feet of two track DNA tape found within each human cell. Instead, he views the human genome as a complex 3D configuration of genes that operates as a complex nonlinear network over the 4th dimension of time. As with commercial data on a disk drive, in the 4D-Genomics model the location of a gene on a chromosome in the genome is very important. Professor Heng points out that most mammals, like mice and men, basically have the same set of genes, but that the location of those genes on chromosomes varies greatly from species to species. As we have seen, the location of a gene on a chromosome can play a significant role in determining how much protein, if any, of a particular protein that a cell produces, due to the way the DNA is packed together, and for a multicellular organism that is key because it is the generated proteins that finally determine the cell type of each cell.
In the 4D-Genomics model, cancer originates in two stages. In the first stage, the chromosomes of eukaryotic cells under stress from carcinogens or other factors fragment into many little pieces and then recombine into new chromosomes, essentially creating many new cells of many new species. Because of the second law of thermodynamics, the odds of these new species of cells at the cellular level actually surviving are quite low, so most of them promptly die. However, natural selection does come into play at this point, as some of these new cell types do actually survive and replicate. Professor Heng calls this the Punctuated Phase, after Stephen Jay Gould’s punctuated equilibrium model of evolution. That is because the dramatic fragmentation and recombination of chromosomes can lead to very significant changes to the functional capabilities of the affected cells in a very short period of time, as is seen in the punctuated equilibrium model of evolution that contends that the slow evolution of species is punctuated by very dramatic changes to species over very short periods of time when a species is subjected to extreme environmental stress. Between these punctuated events there are very long periods of species stasis, during which species evolve very slowly because they are not under environmental stress. Professor Heng has assembled some very sophisticated experimental equipment that microscopically demonstrates the fragmentation and recombination of chromosomes in the laboratory. This is an example of the Chromosome Chaos referred to in the Scientific American article above. The next step in the 4D-Genomics model of cancer occurs after a crisis, like the administration of chemotherapy to the patient. The chemotherapy becomes a significant selection factor that wipes out most of the cancerous cells. However, a small subpopulation of the cancerous cells do survive the chemotherapy and then move on to stage two – the Stepwise Phase of classical Darwinian evolution through small incremental changes. These surviving cancerous cells then slowly evolve through small incremental mutations to the genes of the massively reorganized chromosomes, and ultimately metastasize and spread throughout the body.
Because the 4D-Genomics model intimately ties the origin of cancer to evolutionary biology, it also explains much more than just the origin of cancer. The model can also be used to explain Stephen Jay Gould’s punctuated equilibrium model for macroevolution. The 4D-Genomics model proposes that speciation occurs when massive changes are made to the genome of an existing species at the chromosomal level. Once a new species does appear upon the scene with a new radically modified genome at the chromosomal level, the new species then slowly evolves by means of natural selection operating on small incremental changes to the genome through the mutation of individual genes. We certainly have seen the effects of the punctuated equilibrium model in the architectural evolution of commercial software. See An IT Perspective of the Cambrian Explosion for some examples.
The 4D-Genomics model also explains the long running mystery of sexual reproduction. Sexual reproduction has always come with a very high level of inconvenience and cost, relative to simple asexual reproduction, so the biological mystery has always been why bother with it? Just tally up all of the inconvenience and costs associated with setting up a typical Saturday night date! Most textbooks cite that sexual reproduction arose because it allowed for greater genetic diversity within a population because of chromosomal crossovers at meiosis. But Professor Heng points out that asexual reproduction is the dominant form of reproduction found in harsh environments where genetic diversity is of greater value and that asexual organisms actually display much more genetic diversity than sexual organisms. Based upon these observations, the 4D-Genomics model contends that the purpose of sexual reproduction really is to limit genetic diversity and not to enhance it. In this view, sexual reproduction is a filter that preserves the structure of the entire genome and comes into play at meiosis, fertilization, and the early development of embryos. Basically, in the early stages of life, chromosomal abnormalities do not result in viable embryos. They either never form or they spontaneously abort. Unfortunately, the cells of the body do not go through this same sexual filter and cancerous cells with major chromosomal defects can survive with the support of the surrounding healthy cells and eventually progress to a cancerous state. In this view, cancer is just an unfortunate byproduct of being a complex multicellular organism that uses sexual reproduction, which lends credence to the old joke that life is a sexually transmitted terminal disease!
The key finding of the 4D-Genomics model, for both biology and commercial software, is that it highlights that the network of interacting parts contains as much information, or perhaps even more information, than do the individual parts themselves. For biology this is applicable to the complex network of genes in a genome, and for commercial software it applies to the complex network of service and consumer objects in SOA based applications and the complex underlying database models that they run upon. I believe there is great potential for both biologists and IT professionals to explore the behaviors of these complex networks together.
Comments are welcome at firstname.lastname@example.org
To see all posts on softwarephysics in reverse order go to: