Wednesday, March 18, 2020

A Structured Code Review of the COVID-19 Virus

Many times an IT Application Development group will conduct a structured code review of the source code produced by one of the team members in order to ensure that coding standards are being followed and to offer suggestions that might make the source code more efficient or easier to support in the future by other developers. In this posting, I would like to do the same for the COVID-19 virus. COVID-19 is a parasitic form of RNA that demonstrates the vast powers of self-replicating information to quickly alter an entire planet. In many ways, COVID-19 has behaved like the numerous computer viruses that have taken down world-wide networks over the years. Perhaps those working to subdue COVID-19 might learn some valuable lessons from the IT community's struggle with controlling computer viruses. So a structured code review of COVID-19 is certainly in order.

At the heart of COVID-19 is an RNA molecule with about 29,903 bases that code for 5 genes. Recall that humans have about 3 billion base pairs of DNA that code for about 30,000 genes, so it is rather remarkable that such a small RNA program can take down a much larger human program that is 10,000 times larger. COVID-19 is also known as SARS-CoV-2 because it is very closely related to the SARS-CoV virus that caused the SARS epidemic that started in November of 2002 in the Chinese province of Guangdong. The SARS-CoV virus then spread to 29 other countries in 2003. It is thought that the COVID-19 and SARS-CoV viruses both originated in bats of the genus Rhinolophus and that those bats constitute a permanent reservoir for the viruses. It seems that the COVID-19 virus jumped from bats to humans in November of 2019.

IT Professionals use an IDE (Integrated Development Environment), like Eclipse, to navigate through, and work on, the thousands or millions of lines of source code needed to produce complex software. Thus, during a structured code review, it only makes sense that the Application Development team will use an IDE to review and step through the source code being examined.

Figure 1 – Above is a screenshot of the Eclipse IDE used by IT Professionals to work with software source code.

To perform a proper structured code review of the COVID-19 virus we will also need a similar type of IDE. Fortunately, the federal government of the United States provides an IDE for researchers called the NCBI (National Center for Biotechnology Information) Sequence Viewer. So let us use the NCBI Sequence Viewer to review the RNA source code of COVID-19. The Home page of the NCBI is at:

https://www.ncbi.nlm.nih.gov/

Figure 2 – Above is a screenshot for the Home Page of the NCBI (National Center for Biotechnology Information).

But before that, let's take a look at the COVID-19 executable itself. The COVID-19 executable is a spherical virus particle about 50 - 200 nanometers in diameter. The outer layer of COVID-19 is a viral envelope that is composed of a lipid bilayer stolen from the human host cell that it recently budded from. The reason that soap and water destroy COVID-19 is that they dissolve the lipid bilayer of the COVID-19 viral envelope. Embedded in the viral envelope are three structural proteins known as the S (spike), E (envelope) and M (membrane) proteins. COVID-19 has a fourth structural protein called the N (nucleocapsid) protein that holds the RNA molecule in place. The S spike protein is the protein that allows COVID-19 to attach to the membrane of a human host cell, fuse with it, and enter the host cell.

Figure 3 – Above is the COVID-19 executable.

Now pull up the NCBI Sequence Viewer for the COVID-19 virus at:

https://www.ncbi.nlm.nih.gov/nuccore/MN988668.1?report=graph

You can also pull up the documentation for the NCBI Sequence Viewer at:

https://www.ncbi.nlm.nih.gov/tools/sviewer/

At the very top of the NCBI Sequence Viewer is a ruler for the full 29,903 bases measured off in K-bases. Under the ruler in green, you will see the layout of the 5 genes of the COVID-19 virus - the orf1ab, S, E, M and N genes. The S, E, M and N genes are genes for structural proteins. The orf1ab gene codes for the proteins that help replicate the COVID-19 RNA molecule and package it into new virus particles. Below the gene layout is a toolbar for tools to manipulate the RNA display. The first drop-down on the left side of the toolbar has a "Full View" option that will let you expand the display to full-screen mode. Below the toolbar is a zoom window that lets you zoom into sections of the full RNA display. The range of the zoom window is defined by the range of the two blue arrows located on the top ruler. Below the green gene bars, you will find matching red bars that represent the protein generated by the gene.

Figure 4 – Above is a screenshot of the NCBI Sequence Viewer over a broad range of bases 18,000 - 29,880 for the COVID-19 virus, also known as 2019-nCoV.

If you click on the ATG button on the toolbar, the range on the zoom window will collapse to a very narrow range as shown in Figure 5. It will also show the detailed RNA bits (A, T, C and G) as well as the three-bit RNA bytes that code for individual amino acids. I am not sure why the software displays a DNA T base, instead of the RNA U base, to pair with A bases. Perhaps it is because the NCBI Sequence Viewer is mainly used to view DNA sequences and not RNA sequences. Just think of a U base whenever you see a T base on the Sequence Viewer.

Figure 5 – Above is a screenshot of the COVID-19 virus over a very narrow range (bases 22,710 - 22,760).

Like a modern IDE for source code, the NCBI Sequence Viewer has a huge number of features. As usual, the best way to learn what those features can do is to follow along with a tutorial and then try to use the features on your own. So have some fun playing with it! For example, Figure 6 displays a dump of the first few thousand RNA bases of COVID-19.

Figure 6 – The NCBI Sequence Viewer allows users to display a full dump of all 29,903 bases.

A core dump can be generated when a running program crashes. The core dump then displays the contents of the computer memory for the program when it crashes. Core dumps are now not used extensively for debugging purposes like they used to be used back in the 1960s and 1970s. Back in the 1960s and 1970s, one frequently found core dump printouts that were several inches thick in the wastebaskets of datacenters. In fact, that is how I first got into software. I was taking a quantum mechanics course in my junior year at the University of Illinois at Urbana and having a great deal of difficulty with the very lengthy calculations. The calculations required many pages of standard notebook paper. Then, a buddy of mine in the class showed me that you could use the back of the very thick fan-folded core dump printouts for lengthy computations. So I then headed over to the DCL (Digital Computer Lab) to fetch some.

Figure 7 – A dump of all 29,903 RNA bases looks very much like a core dump of computer memory when a running program crashes.

The reason they are called core dumps is that, long ago, computer memory used to be stored on little magnetic cores. Now computer memory is stored on solid-state memory chips.

Figure 8 – Magnetic core memory arrived in 1955 and used a little ring of magnetic material, known as a core, to store a bit. Each little core had to be threaded by hand with 4 wires to store a single bit.

Figure 9 – Magnetic core memory was hugely expensive and took up a great deal of space within a computer. That is why a million-dollar mainframe in 1970 had about 1 MB of memory, about 64,000 times less memory than a 64 GB smartphone.

Once the RNA from COVID-19 enters a human host cell, it then begins to act like a Turing Machine that replicates the COVID-19 RNA and the proteins that it produces. It then stuffs the COVID-19 RNA into new virus packages formed from the four structural proteins of COVID-19. Recall that a Turing Machine is composed of a read/write head and an infinitely long paper tape. On the paper tape, is stored a sequential series of 1s and 0s, and the read/write head can move back and forth along the paper tape in a motion based on the 1s and 0s that it reads. The read/write head can also write 1s and 0s to the paper tape as well. In Turing’s original paper on the topic, he mathematically proved that such an arrangement could be used to encode any mathematical algorithm, like multiplying two very large numbers together and storing the result on the paper tape. In many ways, a Turing Machine is much like a ribosome in an invaded human cell that reads the COVID-19 RNA and writes out the amino acids of a polypeptide chain that eventually fold up into an operational protein for the COVID-19 virus.

Figure 10 - A Turing Machine has a read/write head and an infinitely long paper tape. The read/write head can read instructions on the tape that are encoded as a sequence of 1s and 0s and can write out the results of following the instructions on the paper tape back to the tape as a sequence of 1s and 0s.

Figure 11 – A ribosome read/write head behaves much like the read/write head of a Turing Machine. The ribosome reads the RNA tape from COVID-19. The ribosome read/write head then reads the A, C, G, and U nucleobases that code for amino acids three at a time. As each 3-bit byte is read on the COVID-19 RNA tape, the ribosome writes out an amino acid to a growing polypeptide chain, as tRNA units from the host human cell bring in one amino acid at a time. The polypeptide chain then goes on to fold up into a 3-D COVID-19 protein molecule.

When the newly completed COVID-19 virus particles are finalized, they bud out of the human host cell or cause the human host cell to burst. In nearly all cases, the human host cell then dies.

The Vast Power of Self-Replicating Information
In Life in Postwar America After Our Stunning Defeat in the Great Cyberwar of 2016 and Cyber Civil Defense, I expressed my amazement at how easily some self-replicating information in the form of parasitic political memes, spread by our very own software, was able to defeat the United States of America in the very first recorded defeat of a world power by cyberwarfare. Again, this brilliant Russian FSB cyberwarfare operation defeated the United States of America for pennies by merely exploiting the already existing deep divisions within the American political landscape. Again, for the most part, we defeated ourselves in 2016. The end result of the Great Cyberwar of 2016 was that the United States of America was left with a feckless leader, friendly to Russian interests, but incapable of much else beyond spending the last three years dismantling the government of the United States of America. So it should come to no surprise that, when all is said and done, it appears that the United States of America will most likely sustain the greatest amount of damage from the current world-wide COVID-19 pandemic. For example, the United States of America and South Korea both became aware of the threat of COVID-19 at the same time but took dramatically different actions to address the threat. Science matters.

For those of you new to softwarephysics, let me repeat the fundamental characteristics of self-replicating information.

Self-Replicating Information – Information that persists through time by making copies of itself or by enlisting the support of other things to ensure that copies of itself are made.

Over the past 4.56 billion years we have seen five waves of self-replicating information sweep across the surface of the Earth and totally rework the planet, as each new wave came to dominate the Earth:

1. Self-replicating autocatalytic metabolic pathways of organic molecules
2. RNA
3. DNA
4. Memes
5. Software

Software is currently the most recent wave of self-replicating information to arrive upon the scene and is rapidly becoming the dominant form of self-replicating information on the planet. For more on the above see A Brief History of Self-Replicating Information.

The Characteristics of Self-Replicating Information
All forms of self-replicating information have some common characteristics:

1. All self-replicating information evolves over time through the Darwinian processes of inheritance, innovation and natural selection, which endows self-replicating information with one telling characteristic – the ability to survive in a Universe dominated by the second law of thermodynamics and nonlinearity.

2. All self-replicating information begins spontaneously as a parasitic mutation that obtains energy, information and sometimes matter from a host.

3. With time, the parasitic self-replicating information takes on a symbiotic relationship with its host.

4. Eventually, the self-replicating information becomes one with its host through the symbiotic integration of the host and the self-replicating information.

5. Ultimately, the self-replicating information replaces its host as the dominant form of self-replicating information.

6. Most hosts are also forms of self-replicating information.

7. All self-replicating information has to be a little bit nasty in order to survive.

8. The defining characteristic of self-replicating information is the ability of self-replicating information to change the boundary conditions of its utility phase space in new and unpredictable ways by means of exapting current functions into new uses that change the size and shape of its particular utility phase space. See Enablement - the Definitive Characteristic of Living Things for more on this last characteristic. That posting discusses Stuart Kauffman's theory of Enablement in which living things are seen to exapt existing functions into new and unpredictable functions by discovering the “AdjacentPossible” of springloaded preadaptations.

Softwarephysics was originally meant to deal with the observation that software is now rapidly becoming the dominant form of self-replicating information on the planet. But the recent COVID-19 pandemic draws attention to the fact that all five waves of self-replicating information are still constantly coevolving with each other and that all five waves still possess the vast powers of self-replicating information to alter an entire planet. That is because all forms of self-replicating information can exapt existing functions into new and unpredictable functions and can then replicate in an exponential manner. For example, since COVID-19 originated in bats, but then jumped from bats to people, "we" have become the “AdjacentPossible” for COVID-19. In just a matter of a few months, the RNA in COVID-19 has managed to reshape the entire planet by altering nearly all human activities and ushering in a new "normal" that will last for many decades.

Comments are welcome at scj333@sbcglobal.net

To see all posts on softwarephysics in reverse order go to:
https://softwarephysics.blogspot.com/

Regards,
Steve Johnston