Sunday, September 01, 2013

Some More Information About Information

For IT professionals it is very important to have a clear understanding of the nature of information in order for us to effectively perform our jobs, and having a good working definition of information is key to that understanding. Unfortunately, currently there are several different formulations for the concept of information to be found within differing fields of study, and this makes the concept of information even more confusing than it needs to be. In softwarephysics we have been exclusively using Leon Brillouin’s concept of information, also known as negentropy, to explore the nature of information in an IT setting because Brillouin’s concept of information seems to be the most useful formulation for IT professionals. In many softwarephysics postings, we have also seen many close correlations between the activities of IT professionals and the functions performed by living biological systems, so Brillouin’s concept of information should also be quite valuable for biologists as well.

In Entropy - the Bane of Programmers and The Demon of Software, we discussed the second law of thermodynamics and Leon Brillouin’s concept of information. Brillouin defined the change in information content of a system undergoing a change as the difference between the initial entropy of the system and the final entropy of the system after the change has been made:

     ∆Information = Si - Sf
     Si = initial entropy
     Sf = final entropy

Where the entropy of a system is defined by Boltzmann’s equation for entropy:

     S = k ln(N)
     N = number of microstates or ways of producing the macrostate of the system.

To make these ideas a bit easier to understand, we enlisted the use of poker hands to illustrate the above concepts, and we discovered that drawing three Aces to yield a hand with four Aces and a 2 of clubs resulted in a large increase in the information content of your hand, especially if your original hand consisted of an Ace of spades, 2 of clubs, 7 of hearts, 10 of diamonds and an 8 of spades. We also saw that physicists got very nervous if you started talking about destroying information because the idea of destroying information threatened all of their current effective theories. Currently, all of the theories of physics are reversible in time, meaning that they work just as well moving backwards in time as they do moving forwards in time, and if it is possible to destroy the information necessary to return a system to its initial state, then the Universe is not reversible in nature as the current theories of physics predict, and the theories of physics then collapse.

However, as IT professionals we all know that it is indeed very easy to destroy information, like destroying the code for a functional program by simply including a few typos, so what is going on? It all boils down to how you want to define the concept of information in a manner that is useful to your area of work. The reason I like Leon Brillouin’s formulation for the concept of information is that it easily highlights the difference between useful information and useless information. In Brillouin’s formulation, things containing lots of useful information have very low entropies, and consequently, are very improbable and rare things, like the code for a totally bug-free program. In Brillouin’s formulation of information, it is also very easy to turn useful low-entropy information into useless high-entropy information by simply scrambling the useful information into useless information. For example, if you take the source code file for an apparently bug-free program and scramble it with some random additions and deletions, the odds are that you will most likely end up with a low-information high-entropy mess. That is because the odds of creating an even better version of the program by means of inserting random additions and deletions are quite low while turning it into a mess are quite high. There are simply too many ways of messing up the program to win at that game. As we have seen this is simply the second law of thermodynamics in action. Leon Brillouin’s formulation for the concept of information also goes hand in hand with our current understanding of energy. According to the first law of thermodynamics, energy cannot be created nor destroyed, but the second law of thermodynamics explains that it is very easy to degrade useful low-entropy energy, like the energy in gasoline, into the useless high-entropy waste heat energy that your car’s engine generates in abundance as you drive along. Similarly, using Brillouin’s formulation for information, the second law of thermodynamics is constantly degrading low-entropy useful information into high-entropy useless information, as the entropy of the entire Universe itself constantly increases with time.

The Concept of Entropy and Information in Communications Theory
However, to make matters more confusing, there is another formulation for the concepts of entropy and information to be found in communications theory, developed by Claude Shannon in 1948. This formulation for the concept of information is very valuable in communications theory because it is useful in developing ways of encoding information digitally and then transmitting the digital data over a noisy channel with a limited bandwidth. It also can be used to define the speed with which digital information can be transmitted over a channel with a given bandwidth. No doubt, the cable company in your area runs many TV commercials making fun of the very slow DSL Internet connections that your local telephone company provides, in order to persuade you to obtain your Internet connection from the cable company. That is because the bandwidth of coaxial cable is much higher than the bandwidth of the simple copper wire that your telephone company possesses, and consequently, your cable company can push much more digital data down their coaxial cables than your poor phone company can push down their plain copper wires. Personally, I have one of those very slow phone company DSL Internet connections, rather than a high-speed coaxial connection from my cable company because my phone company has had more than 100 years of experience with keeping connections up, while my cable company only has about 30 years of experience with providing very spotty availability. As a member of MidOps on pager call for problems, it is much more important for me to have an Internet connection with an availability as close as possible to 100%, rather than to have the bandwidth necessary to push lots of bits down the line. That is because I really do not generate lots of bits of information as I push the buttons on my laptop at a human speed to fix problems.

I am not going to delve much further into communications theory because it is really not needed by softwarephysics, but I do wish to explain Claude Shannon’s concept of information because, unfortunately, it also uses the terms of information and entropy, but in a different way than does Leon Brillouin, and consequently, can lead to a great deal of confusion. Claude Shannon was interested in encoding information digitally and then transmitting it over a noisy channel. To do that he realized that all information could be encoded by using a series of 1s and 0s called bits. We do this in IT today as well. For example, in the ASCII code set we assign A = “01000001” and Z = “01011010”. Similarly, we can define the integer 25 = “00011001” using a base 2 or binary representation. Now once we convert characters and numbers into binary 1s and 0s we can then send them down a communications channel as a series of voltage spikes. Of course, all real communications channels are noisy, so there is always the possibility of misreading a message composed of 1s and 0s due to noise on the channel. To overcome that problem, you have to use an encoding scheme that not only encodes the information to be transmitted, but that also contains information that allows you to tell when a message has been garbled by noise. But in order to do that Shannon first had to figure out how much information was in a message in the first place. For example, both of the messages down below are encoded with a total of 16 1s and 0s:

     0000100000000000
     1001110010100101

However, the first message consists mainly of 0s, so it seems that it should be easier to apply some kind of error detection and correction scheme to the first message, compared to the second message, because the 1s are so rare in the first message. Doing the same thing for the second message should be much harder because the second message is composed of eight 0s and eight 1s. This led Shannon to conclude that the second message must contain more information than the first message. He also concluded that the 1s in the first message must contain more information than the 0s because the 1s were much less probable than the 0s, and consequently, the arrival of a 1 had much more significance than the arrival of a 0 in the message. Using this line of reasoning, Shannon proposed that if the probability of receiving a 0 in a message was p and the probability of receiving a 1 in a message was q, then the information H in the arrival of a single 1 or 0 must not simply be one bit of information. Instead, it must depend upon the probabilities p and q of the arriving 1s and 0s:

     H(p) = - p log2p -  q log2q

Since in this case the message is only composed of 1s and 0s, it follows that:

     q =  1 -  p

Figure 1 shows a plot of the information H(p) of the arrival of a 1 or 0 as a function of p, the probability of a 0 arriving in a message when the message is only composed of 1s and 0s:

Figure 1 – A plot of Shannon’s Information/Entropy equation H(p) versus the probability p of finding a 0 in a message composed solely of 1s and 0s

Notice that the graph peaks to a value of 1.0 when p = 0.50 and has a value of zero when p = 0.0 or p = 1.0. Now if p = 0.50 that means that q = 0.50 too because:

     q =  1 -  p

Substituting p = 0.50 and q = 0.50 into the above equation yields the information content of an arriving 0 or 1 in a message, and we find that it is equal to one full bit of information:

     H(0.50)  =  -(0.50) log2(0.50) - (0.50) log2(0.50)  =  -log2(0.50)  =  1

And we see that value of H(0.50) on the graph in Figure 1 does indeed have a value of 1 bit.

To fully understand the above equation it is necessary to review the concept of logarithms. The above equation uses a base 2 logarithm because we are concerned with messages composed of only two symbols 1 and 0. Recall that the familiar log function found on scientific calculators is a base 10 logarithm log10 and that:

     100 = 1 so the log10(1) = 0
     101 = 10 so the log10(10) = 1
     102 = 100 so the log10(100) = 2
     103 = 1000 so the log10(1000) = 3

Similarly for a base 2 logarithm:

     20 = 1 so the log2(1) = 0
     21 = 2 so the log2(2) = 1
     22 = 4 so the log2(4) = 2
     23 = 8 so the log2(8) = 3

And for numbers that are less than 1.0, like probabilities, the logarithms are negative:

     10-1 = 0.1 so the log10(0.1) = -1
     10-2 = 0.01 so the log10(0.01) = -2
     10-3 = 0.001 so the log10(0.001) = -3

Similarly for a base 2 logarithm:

     2-1 = 0.5 so the log2(0.5) = -1
     2-2 = 0.25 so the log2(0.25) = -2
     2-3 = 0.125 so the log2(0.125) = -3

Now suppose the arriving message consists only of 0s. In that case, p = 1.0 and q = 0.0, and the information content of an incoming 0 or 1 is H(1.0) and calculates out to a value of 0.0 in our equation and also in the plot of H(p) in Figure 1. This simply states that a message consisting simply of arriving 0s contains no information at all. Similarly, a message consisting only of 1s would have a p = 0.0 and a q = 1.0, and our equation and plot calculate a value of H(0.0) = 0.0 too, meaning that a message simply consisting of 1s conveys no information at all as well. What we see here is that seemingly a “messy” message consisting of many 1s and 0s conveys lots of information, while a “neat” message consisting solely of 1s or 0s conveys no information at all. When the probability of receiving a 1 or 0 in a message is 0.50 – 0.50, each arriving bit contains one full bit of information, but for any other mix of probabilities, like 0.80 – 0.20, each arriving bit contains less than a full bit of information. From the graph in Figure 1, we see that when a message has a probability mix of 0.80 – 0.20 that each arriving 1 or 0 only contains about 0.72 bits of information. The graph also shows that it does not matter whether the 1s or the 0s are the more numerous bits because the graph is symmetric about the point p = 0.50, so a 0.20 – 0.80 mix of 1s and 0s also only delivers 0.72 bits of information for each arriving 1 or 0.

Claude Shannon went on to generalize his formula for H(p) to include cases where there were more than two symbols used to encode a message:

     H(p) = - Σ p(x) log2 p(x)

The above formula says that if you use 2, 3, 4, 5 …. different symbols to encode information, just add up the probability of each symbol multiplied by the log2 of the probability of each symbol in the message. For example, suppose we choose the symbols 00, 01, 10, and 11 to send messages and that the probability of sending a 1 or a 0 are both 0.50. That means the probability p for each symbol 00, 01, 10 and 11 is 0.25 because each symbol is equally likely. So how much information does each of these two-digit symbols now contain? If we substitute the values into Shannon’s equation we get an answer of 2 full bits of information:

     H(0.25, 0.25, 0.25, 0.25) =  - 0.25 log2(0.25) - 0.25 log2(0.25)  - 0.25 log2(0.25) - 0.25 log2(0.25)  = 
     - log2(0.25) = 2

which makes sense because each symbol is composed of two one-bit symbols. In general, if all the symbols we use are n bits long, they will then all contain n bits of information each. For example, in biology genes are encoded in DNA using four bases A, C, T and G. A codon consists of 3 bases and each codon codes for a particular amino acid or is an end of file Stop codon. On average, prokaryotic bacterial genes code for about 400 amino acids using 1200 base pairs. If we assume that the probability distribution for all four bases, A, C, T and G are the same for all the bases in a gene, namely a probability of 0.25, then we can use our analysis above to conclude that each base contains 2 bits of information because we are using 4 symbols to encode the information. That means a 3 base codon contains 6 bits of information and a protein consisting of 400 amino acids contains 2400 bits of information or 300 bytes of information in IT speak.

Now here is where the confusion comes in about the nature of information. All along, using Brillouin’s formulation for the concept of information, we have been saying that “messy” things have lots of entropy and little information, while “neat” things have very little entropy and lots of information. The story goes that Claude Shannon was not quite sure what to call his formula for H(p). Then one day in 1949 he happened to visit the mathematician and early computer pioneer John von Neumann, and that is when information and entropy got mixed together in communications theory:

”My greatest concern was what to call it. I thought of calling it ‘information’, but the word was overly used, so I decided to call it ‘uncertainty’. When I discussed it with John von Neumann, he had a better idea. Von Neumann told me, ‘You should call it entropy, for two reasons. In the first place, your uncertainty function has been used in statistical mechanics under that name, so it already has a name. In the second place, and more important, nobody knows what entropy really is, so in a debate you will always have the advantage.”

Unfortunately, with that piece of advice, we ended up equating information with entropy in communications theory.

So in Claude Shannon's Information Theory people calculate the entropy, or information content of a message, by mathematically determining how much “surprise” there is in a message. For example, in Claude Shannon's Information Theory, if I transmit a binary message consisting only of 1s or only of 0s, I transmit no useful information because the person on the receiving end only sees a string of 1s or a string of 0s, and there is no “surprise” in the message. For example, the messages “1111111111” or “0000000000” are both equally boring and predictable, with no real “surprise” or information content at all. Consequently, the entropy, or information content, of each bit in these messages is zero, and the total information of all the transmitted bits in the messages is also zero because they are both totally predictable and contain no “surprise”. On the other hand, if I transmit a signal containing an equal number of 1s and 0s, there can be lots of “surprise” in the message because nobody can really tell in advance what the next bit will bring, and each bit in the message then has an entropy, or information content, of one full bit of information.

This concept of entropy and information content is very useful for people who work with transmission networks and on error detection and correction algorithms for those networks, but it is not very useful for IT professionals. For example, suppose you had a 10-bit software configuration file and the only “correct” configuration for your particular installation consisted of 10 1s in a row like this “1111111111”. In Claude Shannon's Information Theory that configuration file contains no information because it contains no “surprise”. However, in Leon Brillouin’s formulation of information there would be a total of N = 210 possible microstates or configuration files for the 10-bit configuration file, and since the only “correct” version of the configuration file for your installation is “1111111111” there are only N = 1 microstates that meet that condition.

Using the formulas we learned in The Demon of Software, we can now calculate the entropy of our single “correct” 10-bit configuration file and the entropy of all possible 10-bit configuration files:

Boltzman's Definition of Entropy
S = ln(N)
N = Number of microstates

Leon Brillouin’s Definition of Information
∆Information = Si - Sf
Si = initial entropy
Sf = final entropy

as:

Sf = ln(1) = 0

Si = ln(210) = ln (1024) = 6.93147

So using Leon Brillouin’s formulation for the concept of information the Information content of a single “correct” 10-bit configuration file is:

Si - Sf = 6.93147 – 0 = 6.93147

which, if you look at the table in The Demon of Software, contains a little more information than drawing a full house in poker without drawing any additional cards and would be even less likely for you to stumble upon by accident than drawing a full house.

So in Claude Shannon's Information Theory, a very “buggy” 10 MB executable program file would contain just as much information and would require just as many network resources to transmit as transmitting a bug-free 10 MB executable program file. Clearly, Claude Shannon's Information Theory formulations for the concepts of information and entropy are less useful for IT professionals than are Leon Brillouin’s formulations for the concepts of information and entropy.

What John von Neumann was trying to tell Claude Shannon was that his formula for H(p) looked very much like Boltzmann’s equation for entropy:

     S = k ln(N)

The main difference was that Shannon was using a base 2 logarithm, log2 in his formula, while Boltzmann used a base e natural logarithm ln or loge in his formula for entropy. But given the nature of logarithms, that really did not matter much. To see why pull up the scientific calculator on your PC. It will have an ln button to find base e natural logarithms (loge) and a log button to find normal base 10 (log10) logarithms.

Figure 2 – Scientific calculators have an ln button to find base e natural logarithms loge and a log button to find normal base 10 log10 logarithms

It is very easy to convert from one logarithm to another by just multiplying by a constant:

     log2(x) = 3.3219 log10(x) = 3.3219 log(x)

     ln(x) = loge(x) = 2.3025 log10(x) = 2.3025 log(x)

Try using the calculator to validate that:

     log(100) = 2

     log2(100) = 2 * 3.3219 = 6.6438
     (hint try using the the xy button to do a 26.6438)

     ln(100) = 4.6052

Notice that the log of numbers less than 1.0, like probabilities such as 0.5 are always negative. For example log10(0.5) = -0.3010 and log2(0.5) = 3.3219 * -0.3010 = -1.0 as we saw above. Because the log of a number that is less than 1.0 is always negative, Shannon had to use minus signs in his equation to have it yield an H(p) Information/Entropy value that was positive.

The main point of confusion arises because in communications theory the concepts of information and entropy pertain to encoding and transmitting information, while in IT and many other disciplines, like biology, we are more interested in the amounts of useful and useless information in a message. For example, in communications theory, the code for a buggy 300,000 byte program contains just as much information as a totally bug-free 300,000 byte version of the same program and would take just as much bandwidth and network resources to transmit accurately over a noisy channel as transmitting the bug-free version of the program. Similarly, in communications theory a poker hand consisting of four Aces and a 2 of clubs contains just as much information and is just as “valuable” as any other 5-card poker hand because the odds of being dealt any particular card is 1/52 for all the cards in a deck, and therefore, all messages consisting of 5 cards contain exactly the same amount of information. Similarly, all genes that code for a protein consisting of 400 amino acids all contain exactly the same amount of information, no matter what those proteins might be capable of doing. However, in both biology and IT we know that just one incorrect amino acid in a protein or one incorrect character in a line of code can have disastrous effects, so in those disciplines the quantity of useful information is much more important than the number of bits of data to be transmitted accurately over a communications channel.

Of course, the concepts of useful and useless information lie in the eye of the beholder to some extent. Brillouin’s formula attempts to quantify this difference, but his formula relies upon Boltzmann’s equation for entropy, and Boltzmann’s equation has always had the problem of how do you define a macrostate? There really is no absolute way of defining one. For example, suppose I invented a new version of poker in which I defined the highest ranking hand to be an Ace of spades, 2 of clubs, 7 of hearts, 10 of diamonds and an 8 of spades. The odds of being dealt such a hand are 1 in 2,598,964 because there are 2,598,964 possible poker hands, and using Boltzmann’s equation that hand would have a very low entropy of exactly 0.0 because N = 1 and ln(1) = 0.0. Necessarily, the definition of a macrostate has to be rather arbitrary and tailored to the problem at hand. But in both biology and IT we can easily differentiate between macrostates that work as opposed to macrostates that do not work, like comparing a faulty protein or a buggy program with a functional protein or program.

An IT Application of These Concepts
Recently at my place of employment, we lost the system disk for one of our nodes in a WebSphere Cell consisting of 6 servers or nodes. When the problem was first detected by UnixOps, I was paged out to bring down WebSphere on the server so that UnixOps could work with IBM that night to install a new disk for the server. When the server finally came back up, I restarted WebSphere on the server as usual, and had some programmers validate the applications in the Cell with test transactions. This is always a problematic effort because only 1/6th of the validation transactions actually hit the affected server, while the other 5/6ths of the validation transactions hit the other healthy servers, so this form of validation is not very effective, and in this case did not uncover a problem we had with the new disk. As usual, UnixOps had to restore the system disk using backup tapes, and unfortunately, an old obsolete tnsnames.ora file was installed on the disk by mistake. The tnsnames.ora file is an Oracle configuration file that defines database addresses for establishing connections to Oracle databases, and because we now had an old obsolete file on the server, certain WebSphere datasources were not working properly, and were causing some intermittent errors for some of our website end-users. When the problem was finally detected, we paged out Oracle DBA to take a look, and working with UnixOps, they pulled a more recent version of the tnsnames.ora file from the backup tapes. MidOps upper management was a bit perturbed that the wrong tnsnames.ora file had been restored to the server, so I was given the task of comparing the tnsnames.ora file on the affected server with the other tnsnames.ora files on the other 5 servers in the Cell. I pulled the tnsnames.ora file from each server, and using the Unix diff command, I compared all of the files for differences. I then found that there were actually two different versions of the tnsnames.ora file on the servers in the Cell. To make matters worse, I actually found four different versions of the tnsnames.ora file on the twin WebSphere Cell that we use to balance traffic with, for a total of 6 different versions of the tnsnames.ora file on 12 servers. IT Management was very surprised to find so many different versions of the tnsnames.ora file floating around on the WebSphere Cell servers because it is such a critical configuration file, and requested that a single “correct” tnsnames.ora file be distributed to all of the servers.

Being a softwarephysicist, I immediately began to cringe at the thought of trying to unravel all of the spaghetti to obtain a single “correct” composite tnsnames.ora file to be used by all of the servers. Softwarephysics told me that embarking upon such an endeavor was a very dangerous thing indeed. From the above analysis, we see that all 6 versions of the tnsnames.ora file are members of the same macrostate of being a functional file with zero errors for their particular server, and consequently, all contained the same amount of useful information. Coming up with a single composite tnsnames.ora file and installing it on all 12 servers would constitute a change, and according to the second law of thermodynamics, the total amount of entropy in the Universe must increase and the total amount of useful information must decrease when such a change is made. The trick to pulling off such a change, without causing an outage, is to dump the required increase of entropy and diminished amount of useful information into waste heat. But that is a very difficult thing to do indeed, and the odds are that something will be overlooked in coming up with a composite tnsnames.ora file that works for all 12 servers because such a tnsnames.ora file would have far fewer constituent microstates, like a royal flush in poker, and would thus have a much lower entropy and contain much more useful information than the 6 versions of the file that we currently have floating around in production. Now having a composite tnsnames.ora file that works for all 12 servers would be the ideal situation, and it would indeed contain more useful information than the other tnsnames.ora files because, not only would it work on all 12 servers, but if the file were lost on one server, we could simply copy the file from one of the other servers, rather than trying to pull the file from a backup tape. But as we have seen, trying to produce such a composite tnsnames.ora file would be swimming upstream against Nature’s desire to constantly increase the amount of entropy and disorder in the Universe. That is why in IT we generally follow the prescription of “If it ain’t broke, don’t fix it”.

Whenever I find myself in such awkward situations, like having 6 versions of the same file, I always lament IT’s reluctance to heed the tenets of softwarephysics. Softwarephysics maintains that the only way to prevent such high-entropy messy situations from occurring is to rigorously follow standards at all times and for all things, as the biosphere does, and not to cave into the expediency of the moment to rush nonstandard software into production in order to hit a deadline. Oh what a tangled web you weave, when first you practice to NOT FOLLOW STANDARDS!

Comments are welcome at scj333@sbcglobal.net

To see all posts on softwarephysics in reverse order go to:
https://softwarephysics.blogspot.com/

Regards,
Steve Johnston