Tuesday, November 12, 2019

WGD - Whole Genome Duplication
How Carbon-Based Life Installs a New Major Release into Production

Writing and maintaining software is very difficult because so much can go wrong. As we saw in The Fundamental Problem of Software this is largely due to the second law of thermodynamics introducing small bugs into software whenever software is changed and also to the nonlinear nature of software that allows small software bugs to frequently produce catastrophic effects. That is why in Facilitated Variation and the Utilization of Reusable Code by Carbon-Based Life we saw that most new computer or biological software is not written from scratch. Instead, most new software is simply a form of reusable code that has been slightly "tweaked" to produce new software functionality. In her Royal Institution presentation:

Copy number variation and the secret of life
https://www.youtube.com/watch?v=BJm5jHhJNBI&t=1s

Professor Aoife McLysaght explains how carbon-based life uses this same technique to produce new biological functionality by duplicating genes. The website for Professor McLysaght's lab is located at:

Aoife McLysaght Molecular Evolution Lab
http://www.gen.tcd.ie/molevol/

Once you duplicate a gene, that allows one of the two copies to continue to produce the protein encoded by the gene at normal levels while its copy is then free to slightly mutate into a new form that might be able to produce an enhanced protein or an additional protein with new biological function. It is the golden rule of wing-walking in action - don't let go of something until you have hold of something else. Meaning, that if a single gene mutates in isolation it will most likely produce a protein that no longer works and that will be detrimental, or possibly, even fatal for an organism.

Figure 1 – Above we see a gene with four functions. Once the gene has been duplicated, it is possible for the copy of the gene to evolve by divergence. In the first case, we see Subfunctionalization where some of the gene's code disappears from each chromosome of descendants. In the second case, we see Neofunctionalization where the gene on the copied chromosome is free to mutate by changing some genetic code and dropping other genetic code. In the last case, we see the total loss of the copied gene.

All computer users know the importance of keeping backup copies of files around before messing with them in case a drastic mistake is made. Nowadays, most people keep backup copies on the Cloud with Microsoft or Google.

Professor McLysaght then explains that gene duplication can be classified into two broad categories:

SSD - Small Scale Duplication
WGD - Whole Genome Duplication

In SSD one gene or a small group of genes is accidentally duplicated elsewhere on the same chromosome or a different chromosome when DNA is copied. On the other hand, with WGD the entire genome is accidentally duplicated by essentially doubling the number of chromosomes in a cell. The trouble with SSD is that the duplicated gene or genes will at first most likely produce more of the encoded proteins than is usual. In fact, all things being equal, nearly twice as much of the proteins will be at first produced. This is called the "dosage" problem. You see, doubling the production level of a given protein can cause problems. The processing logic carried out by proteins is quite complex. Some proteins are used to build physical structures, like the kerogen in our hair, fingernails and skin, while other proteins are used to carry out biochemical reactions like the hemoglobin in our blood. Other proteins take on a control function by catalyzing biochemical reactions or even by amplifying or inhibiting the expression of other genes. So changing the relative dosage levels of a protein or a group of proteins by means of SSD can be quite dangerous. However, this problem is averted if the entire genome of an organism is duplicated by means of WGD. With WGD the number of all the genes is doubled and so the relative dosage levels of all the generated proteins should remain the same. Now, with one complete set of genes taking the production load for protein production the other set of genes are free to mutate or even disappear. The significance of WGD gene duplications in the evolutionary history of vertebrates was first proposed by Susumu Ohno in 1970 in his book Evolution by Gene Duplication.

Figure 2 – Whole Genome Duplication (WGD) was first proposed by Susumu Ohno in 1970.

Figure 3 – Here we see the difference between SSD and WGD gene duplication.

Since then, bioinformatics has overwhelmingly confirmed the key role of gene duplication in molecular evolution by comparing the genomes of many species at the genetic level of DNA sequences. In fact, the term "ohnolog" has been coined to describe gene duplicates that have survived since a WGD event.

Another good resource for exploring the impact of WGD events in the evolutionary history of carbon-based life is Dr. Hervé Isambert's lab at:

The Isambert Lab
Reconstruction, Analysis and Evolution of Biological Networks
Institut Curie, Paris
http://kinefold.curie.fr/isambertlab/

Among many other resources, the Isambert Lab has been working on the OHNOLOGS database. The OHNOLOGS database currently allows users to explore the genes retained from WGD (Whole Genome Duplication) events in 27 vertebrate genomes and is available at:

OHNOLOGS - A Repository of Genes Retained from Whole Genome Duplications in the Vertebrate Genomes
http://ohnologs.curie.fr/

Figure 4 – Above is a figure from the Isambert Lab that displays a multitude of WGD events in the evolutionary history of carbon-based life.

Figure 5 – Above is a figure that displays a multitude of WGD events specifically in the evolutionary history of carbon-based plantlife.

Further Confirmation of WGD From the Evolution of Computer Software
Softwarephysics maintains that both carbon-based life and computer software have converged upon many of the same solutions to shared data processing problems as they both learned to deal with the second law of thermodynamics in a nonlinear Universe. This should come as no surprise since both carbon-based life and computer software are simply forms of self-replicating information facing the common problems of survival. For more on that please see A Brief History of Self-Replicating Information. For more details on the evolutionary history of software see the SoftwarePaleontology section of SoftwareBiology. So it should come as no surprise that those doing the development and maintenance of computer software should have also discovered the advantages of taking a WGD approach. All IT professionals should be quite familiar with the steps used to move new code into Production, but for those non-IT readers, let me briefly explain the process. Hopefully, you will be able to see many WGD techniques being used in a number of places.

Software Change Management Procedures
Software Change Management arose in the IT departments of major corporations in the 1980s. Prior to the arrival of Change Management processes, corporate IT programmers simply wrote and tested their own software changes in private libraries on the same hardware that ran Production software. When it was time to install the changed software into Production, we simply filled out a ticket to have Data Management move the updated software files from our personal libraries to the Production libraries. Once that was done, the corporate IT programmers could validate the software in the Production libraries with a test batch run before the next scheduled Production run of the batch job. This worked just fine until Production software evolved from batch jobs to online processing by corporate end-users in the early 1980s and especially when external end-users began to interactively use Production software in the 1990s. For example, when I retired in December of 2016, I was in the Middleware Operations group for a major credit card company. All installs were done late at night and during the very early morning hours during our daily Change Window. For an example of a complex software infrastructure supporting a high-volume corporate website please see Software Embryogenesis. Usually, we did about 20 installs each night to cover bug fixes and minor software enhancements. Every change was done under an approved Change Ticket that had an attached install plan that listed all of the items to be installed and the step-by-step timing of each install step. Each install plan also had steps to validate the install and back out the install if problems occurred.

We ran all the Production software in two separate datacenters that were several hundred miles apart. Each datacenter had several hundred Unix servers and ran the exact same Production software. The hardware and software in each datacenter were sized so that it could handle our peak processing load during the middle of the day. Usually, both datacenters would be in Active Mode and taking about half of the total Production processing load. If something horrible happened in one datacenter the Command Center could shift our entire Production processing load to the other datacenter. So during the Change Window for a particular Change Ticket, the Command Center would first move all traffic for the application being changed to the second datacenter. We would then install the new software into the first datacenter and crank it up. Professional validators would then run the new software through a set of validation tests to make sure the software was behaving properly. Then, the Command Center would shut down traffic to the application in the second datacenter to force all traffic to the first datacenter that was running the new software. We would then let the new software on the first datacenter "burn-in" for about 30 minutes of live traffic from real end-users on the Internet. If anything went wrong, the Command Center would move all of the application traffic back to the second datacenter that was still running the old software. We would then back out the new software in the first datacenter and replace it with the old software following the backout plan for the Change Ticket. But if the "burn-in" went well, we would reverse the whole process of traffic flips between the two datacenters to install the new software in the second datacenter. However, if something went wrong the next day with the new software under peak load and an outage resulted, the Command Center would convene a conference call and perhaps 10 people from Applications Development, Middleware Operations, Unix Operations, Network Operations and Database Operations would be paged out and would join the call. The members of the outage call would then troubleshoot the problem in their own areas of expertise to figure out what went wrong. The installation of new code was naturally always our first suspicion. If doing things like restarting the new software did not help, and all other possibilities were eliminated as much as possible, the members of the outage call would come to the decision that the new software was the likely problem, and the new software would be backed out using the Change Ticket backout plan. I hope that you can see how using two separate datacenters that are hundreds of miles apart takes full advantage of the WGD technique used by carbon-based life to keep carbon-based Production up and running at all times for routine maintenance. However, biologists have also discovered that in the evolutionary history of carbon-based life, WGD technology also played a critical role in the rise of new species, so let us take a look at that from an IT perspective.

The Role of WGD Technology in the Implementation of New Species and Major Code Releases
In the above discussion, I explained how the standard Change Management processes were used for the daily changes that Middleware Operations made on a daily basis. However, every few months we conducted a major code release. This was very much like implementing a new species in biology. For a major code release, all normal daily Change Tickets were suspended so that full attention could be focused on the major code release. For a major code release, Applications Development appointed a Release Coordinator for the software release and perhaps 30 - 60 Change Tickets would be generated for the major code release. Each Change Ticket in the major code release had its own detailed installation plan, but the Release Coordinator would also provide a detailed installation and backout plan for all of the Change Tickets associated with the major code release. From an IT perspective, a major code release is like creating a new species. It is like moving from Windows 8.0 to Windows 10.0. The problem with a large major code release is that it cannot all be done in a single standard Change Window during the night and early morning hours. Instead, an extended Change Window must be approved by IT Management that extends into the next day and might complete around 3:00 or 4:00 PM the next day. The basic idea was to totally complete the new major code release in the first datacenter during the early morning hours in the standard Change Window. Once that was done, live traffic was slowly transferred to the first datacenter. For example, initially, only 10% of the live traffic was transferred to the first datacenter running the new major code release. After about an hour of "burning in" the new code, the traffic level in the first datacenter was raised to 30% for about 30 minutes. If all went well, the load level on the first datacenter was raised to 80% for 30 minutes. Finally, 100% of the traffic was transferred to the first datacenter for about 30 minutes for a final "burn-in". After that, the whole install team shifted work to the second datacenter. The most significant danger was that even though the first datacenter had run 100% of the traffic for about 30 minutes, it did so during an early part of the day when the total processing load was rather low. The worst thing that could happen would be for the first datacenter that was running 100% of the Production load on the new major code release would get into trouble when the peak load hit around 10:00 AM. Should that happen, we would be in the horrible situation where the second datacenter was unusable because it was halfway through the major code release and the first datacenter was experiencing problems due to the major code release. Such a situation could bring an entire corporate website down into a "hard down" condition. A "hard down" condition can cost thousands to millions of dollars per second depending on the business being conducted by the software. Such a state of affairs needs to be avoided at all costs and to do that, IT relies heavily on the WGD technique.

First, there are three separate software environments running the current software genome:

Production
Production is the sacred software environment in which no changes are allowed to be made without a Production Change Ticket that has been approved by all layers of IT Management. Production software is sacred because Production software runs the business and is the software that all internal and external users interact with. If Production software fails, it can cost a business or governmental agency thousands or millions of dollars each second! That is why all IT professionals are deathly afraid of messing up Production and, therefore, follow all of the necessary Change Management processes not to do so. I personally know of very talented IT professionals who were summarily fired for making unauthorized changes to Production.

Production Assurance
Production Assurance is the environment that is set up by IT Management to mimic the Production environment as best as possible. It usually is a scaled-down version of Production that does not take Production load. Production Assurance is like a wind tunnel that allows new software to experience the trials and tribulations of Production but using a scaled-down model of Production instead of the "real thing". Production Assurance is where all of the heavy-duty software testing takes place by professional Production Assurance testers. The IT professionals in Applications Development who write the new code do not do testing in Production Assurance. Once software has been exhaustively tested in Production Assurance, it is ready to move to Production with a scheduled Change Ticket in a scheduled Change Window.

Development
The Development environment is where IT professionals in Application Development program new code and perform unit and integration testing on the new code. Again, most new code is reusable code that has been "tweaked". Once all unit and integration testing have been completed on some new code, a Production Assurance Change Ticket is opened for Middleware Operations, Unix Operations, Database Operations and Network Operations to move the new software to Production Assurance for final system-wide testing.

Conclusion
As you can see, IT has also discovered the benefits of the WGD techniques developed by carbon-based life to introduce new genes and new species into the biosphere. Not only do corporate IT departments generally run Production software on two separate Production environments, but corporate IT departments also use multiple WGD environments like Production, Production Assurance and Development to produce new software functionality, and most importantly, move new major releases into Production as a new species of software. Thus, as I suggested in How to Study the Origin of Life on the Earth and Elsewhere in the Universe Right Here at Home, I highly recommend that all researchers investigating the roles that WGD and SSD gene duplication played in the evolutionary history of carbon-based life spend a few months doing some fieldwork in the IT department of a major corporation or governmental agency.

Comments are welcome at scj333@sbcglobal.net

To see all posts on softwarephysics in reverse order go to:
https://softwarephysics.blogspot.com/

Regards,
Steve Johnston

No comments: