The chemical alphabet of genetic information contains only four "letters": the nucleotides adenosine, guanine, thiamine and cytosine (uracyle). Then the information contained in the sequence of n nucleotides in the DNA chain is equal to I(1) log2(4n) = 2n bits. This is the first level of reception of the information. At the next level of reception, the information that is coded in a protein chain (enzyme) synthesised on the sequence of n nucleotides has to be written. Indeed, each of 20 different amino acids is coded by three nucleotides (three-letter words), so that although the information per amino acid is equal to I(2) = log2(20)n = 4.32n bits, the information per nucleotide (letter) will be equal to Inuc = I(2)/3 = 1.44n bits, i.e. less than at the previous level. The decrease of information is a result of degeneration of the triplet code: the total number of codons (43 = 64) is higher than the number of amino acids (20). The redundancy of information at this level r(2) = 1 - (I(27e) = 1 - (log2(20)N/log2(4)3N) < 0.28 and, correspondingly, its cost C(2) = 1/(1 - R(2)) < 1.4.
At the next level we take into account the so-called neutral mutations, when it is possible to replace one amino acid by others without the change of protein properties. Hence, the number of really irreplaceable amino acids decreases, and the amount of information decreases too. These mutations are very difficult to estimate, but it is known that their number connects to pair correlations between amino acids (two-letter words).
Note also that these correlations are significantly higher than the triplet, quadruplet, etc. correlations (Ebeling et al., 1990). Therefore, we can conclude that the genetic text (genome) contains, as a rule, six-letter words in a 20-letter alphabet. Then the possible number of 1-length texts is equal to w1 = (20)1/6, and the corresponding information per nucleotide = (1/6) log2(20) bits. The redundancy and the cost of information R3 = 0.64 and C(3) = 2.6.
The main problem here is what is implied by the length of genetic text, 1? The first idea, which was already used in Section 4.8, is to connect the value of 1 with the amount of DNA in cells of individuals at a certain taxonomic level. In general, the estimates of nuclear DNA contents are provided in picograms (pg, 1 pg = 10—12 g) or in base pairs (bp) of double-stranded DNA. Each strand is a linear polynucleotide chain consisting of four nucleotides, two purines and two pyrimidines, and it is commonly accepted that the average molecular weight for each nucleotide is approximately 618 Da. The conversion factors are: 1 bp = 1.02 X 10—9 pg = 618 Da (Li and Grauer, 1991). For instance, the lowest amount of DNA (when only the non-repetitive DNA is taken into account) in cells of the group Amelids is equal to 0.07 pg (Fonseca et al., 2000). By converting to nucleotides (1 pg = 0.98 X 109 bp) we get for one nucleotide chain: 1 = (0.98 X 109) X 0.07 X (1/2) = 3.43 X 107 nucleotides. Then the possible number of genetic texts (virtual genomes) consisting of six-letter words w1 = (20)1/6 = (20)5 7X106, and the corresponding information I^ < 1.71 X 107 bits. Note that while estimating w1 Fonseca et al. stopped at the second level, when only a degeneration of the code is taken into account. Then w1 = (20)1/3 = (20)114X107 and I® = 3.42 X 107 bits. If we repeat all these calculations for Mammals (the lowest amount of DNA: 3 pg), then w1 = (20)1/6 = (20)244X108 and 43) < 7.32 X 108 bits.
Another approach uses such values as the number of non-nonsense genes, gi, and the number of amino acids in each gene's code (each gene is determined, on an average, by a sequence of about 700 amino acids; Li and Grauer, 1991). Then the length of nucleotide text will be equal to 1 = 700g. For the group Amelids g = 10, 500, and 1 = 7.35 X 106. Since we consider only non-nonsense genes, a partial ordering has already been taken into account; therefore, w1 = (20)1 = (20)7 35X10 and the corresponding amount of information (information content) Inuc = 2:2 X 107 bits. If we repeated all these calculations for Mammals (g = 1.4 X 105), then w1 = (20)°'98X1°8 and Inuc < 2.93 X 108 bits.
Comparing these estimates with estimates made by the DNA-content method, we see that they are relatively close: I^i < 1.71 X 107 and Inuc = 2.2 X 107 bits for Amelids, and 43) < 7 .32 X 108 and Inuc < 2.93 X 108 bits for Mammals. It is interesting that for Annelids the first estimation is slightly less than the second one, but for Mammals we see the opposite picture. Besides, if we take into account the accuracy of all these estimations, then there is a good agreement between them. And finally, we would like to emphasise that these results will be actively used in the next chapter.
Was this article helpful?