Codominant markers

In a diploid species, each dominant marker will identify one allele in a homozygous individual and two alleles in a heterozygous individual (Figure 2.4). This ability to distinguish between homozygotes and heterozygotes is one of the most important features of co-dominant markers because it means that we can calculate easily the allele frequencies for pooled samples (such as populations). Allele frequency simply refers to the frequency of any given allele within a population, i.e. it tells us how common a particular allele is. If we had a diploid population with 30 individuals then there will be a total of 30 x 2 = 60 alleles at any autosomal locus. If 12 individuals had the homozygous genotype AA and 18 individuals had the heterozygous genotype Aa at a particular locus, then the frequency of allele A

150 bases

146 bases-144 bases-142 bases

1A 2A SA 4A

Microsatellite data


AFLP data

Figure 2.4 Gel showing the genotypes of four individuals based on one microsatellite (co-dominant) locus (1A-4A) and several AFLP (dominant) loci (1B-4B). According to the microsatellite locus, individuals 1 and 3 are heterozygous for alleles that are 142 and 146 bases long, whereas individuals 2 and 4 are homozygous for alleles that are 144 and 150 bases, respectively. Since there are two of each allele in this sample of eight alleles, the frequency of each microsatellite allele is 0.25. According to the AFLP marker, which screens multiple loci, all four individuals are genetically distinct but we cannot identify homozygotes and heterozygotes, nor can we readily calculate allele frequencies is [2(12) + 18]/60 = 42/60 = 0.7 or 70 percent and the frequency of allele a is 18/60 = 0.3 or 30 percent. As we will see in later chapters, numerous analytical methods in population genetics are based at least partially on allele frequencies.

It is important to note that although each co-dominant marker characterizes a single locus, most projects will use multiple co-dominant markers to generate data from a number of different loci so that conclusions are not based on a single, possibly atypical, locus. The main drawback to using these types of markers is that they tend to be a relatively time-consuming and expensive way to generate data, and in practice this can limit the number of loci that are genotyped.


In Chapter 1 we learned that allozymes were among the first markers to unite molecular genetics and ecology when they were used to quantify the levels of genetic variation within populations. Since their inception in the 1960s, allozymes have played an ongoing role in studies of animal and plant populations, although in recent years they have featured less prominently than DNA markers. Allozymes benefit from their co-dominant nature and may be more time- and cost-effective than some other markers because they do not require any DNA sequence information. However, as we noted in Chapter 1, they provide conservative estimates of genetic variation because their variability depends entirely on non-synonymous substitutions in protein-coding genes. In addition, allozymes are of limited utility when we are interested in the evolutionary relationships between different alleles; if an individual has allele B, there is no reason to believe that its ancestor had allele A, in other words it is not always possible to identify an ancestor and its descendant.

Another property of allozymes is that they are functional proteins and therefore are not always selectively neutral. This can be both an advantage and a disadvantage. A lack of neutrality can be a disadvantage if a marker is being used to test whether or not populations are genetically distinct from one another. The free-swimming larvae of the American oyster (Crassostrea virginica) can travel relatively long distances if swept along on ocean currents. Populations that are not connected by currents may therefore be genetically distinct from one other, a hypothesis that was tested in a genetic survey of populations located along the Atlantic and Gulf coasts of the USA (Karl and Avise, 1992). A comparison of mtDNA and six anonymous nuclear sequences clearly showed that populations around the Gulf of Mexico were, in fact, genetically distinct from those located along the Atlantic coast, a finding that is consistent with the expectation of very low dispersal between coastlines that are not connected by currents. Variation at six allozyme loci, on the other hand, revealed no genetic differences between the two geographical areas, presumably because natural selection has been maintaining the same alleles in different populations. If the researchers had looked at only allozyme data they probably would have concluded that larvae regularly travelled between the Atlantic and Gulf coasts, a finding that would have been difficult to reconcile with the ocean currents in that region.

On the other hand, a non-neutral marker can be useful if we are looking for evidence of adaptation. Mead's sulphur butterfly Colias meadii showed some interesting patterns of variation in the glycolytic enzyme phosphoglucose isomer-ase (PGI), an enzyme involved in glycolysis, which provides fuel for insect flight (Watt et al., 2003). Because flight ability is related to fitness, the allele that confers the best flight ability should be selected for, and therefore a level of genetic uniformity may be expected at the locus coding for PGI. This prediction was supported only partially by a comparison of PGI alleles from C. meadii that were sampled from lowland (below the tree line) and alpine (above the tree line) habitats in central USA. Populations showed a high level of genetic uniformity over several hundred kilometres within habitats but a marked and abrupt shift in allele frequencies between habitats.

Both of these trends are apparently driven by natural selection. Colias butterflies spend their adult life within a neighbourhood radius that seldom exceeds 1.5 km. These low levels of dispersal mean that genetic uniformity over hundreds of kilometres must be maintained by a selective force, in this case the relationship between PGI alleles and fitness. Selection also explains the contrasting allele frequencies between alpine and lowland habitats. Because the two habitats are delineated by the tree line, they may be expected to have different thermal (and other abiotic) properties. The activity of PGI varies with temperature, and the authors of this study suggest that alternative PGI alleles may be selected for under different thermal regimes (Watt et al., 2003).

The markers that we will be discussing in the rest of this chapter all target variation in DNA as opposed to proteins. Although allozymes are often subject to selection pressures, DNA markers are more likely to be neutral because they often target relatively variable sequences that, in turn, are less likely to be selectively constrained. However, it is important to bear in mind that not all DNA markers are selectively neutral. In some cases we will specifically discuss non-neutral DNA markers. In other cases neutrality may be assumed, although it is always possible that an apparently non-functional region of DNA is subject to selective pressures that are acting on a genetic region to which it is linked-the so-called hitch-hiking effect (Maynard Smith and Haigh, 1974). A more detailed discussion of genetic markers and natural selection is included in Chapter 4.

Restriction fragment length polymorphisms

The first widespread markers that quantified variation in DNA sequences (as opposed to proteins) were restriction fragment length polymorphism (RFLPs).

RFLP data are generated using restriction enzymes, which cut DNA at short (usually four to six base pairs), specific sequences. Examples of restriction enzymes include AluI, which cuts DNA when it encounters the sequence AGCT, and EcoRV, which cuts in the middle of the sequence GATATC. Digesting purified DNA with one or more restriction enzymes can turn a single piece of DNA into multiple fragments. If two individuals have different distances between two restriction sites, the resulting fragments will be of different lengths. The RFLPs therefore do not survey the entire DNA sequence, but any mutations that add or remove a recognition site for a particular enzyme, or that change the length of sequence between two restriction sites, will be reflected in the sizes and numbers of the fragments that are run out on a gel (Figure 2.5).

g. Allele 1

ra E

Figure 2.5 Three different RFLP genotypes result from sequence differences that affect the restriction enzyme recognition sites (designated as /). At this locus, individuals A and B are homozygous for alleles that have two and three restriction sites, respectively. Individual C is heterozygous, with two restriction sites at one allele and three restriction sites at the other allele. The numbers of bands that would be generated by the RFLP profiles are shown in the resulting gel image

Analysis of RFLPs can be done on either an entire genome (nuclear or organelle) or a specific fragment of DNA. The traditional method involves digesting DNA with one or more enzymes and then running out the fragments on a gel. These are then transferred onto a membrane that is placed in a solution containing multiple single-stranded copies of a particular sequence, all of which have been labelled radioactively or fluorescently (this is known as a probe). The single-stranded probe will hybridize to the bands that contain its complementary sequence, and these bands then can be identified from the radioactive or fluorescent label. The number of bands produced will depend on the region surveyed and the enzyme used. For example, a digestion with AluI produces around 341 bands in tobacco chloroplast DNA, whereas EcoRV produces only 36 bands (Shinozaki et al., 1986). The same enzymes generate approximately 64 bands and three bands, respectively, in human mitochondrial DNA (Anderson et al., 1981). A comparison of the number and sizes of labelled bands among individuals

Individual A Individual B Individual C







provides an estimate of overall genetic similarity. This method is useful for screening relatively large amounts of DNA but is fairly cumbersome and time-consuming.

A more straightforward method of generating RFLP data is to first amplify a specific fragment of DNA using PCR and then digest the amplified product with enzymes. The fragments then can be visualized after they are run out on a highresolution gel. This technique is known as PCR-RFLP. Development of PCR-RFLP markers inevitably involves a period of trial and error during which different combinations of primers and enzymes must be screened before enough variable sites can be identified, but overall it is a fairly straightforward technique. In one study, PCR-RFLP markers were used to compare regions of the chloroplast genome of heather (Calluna vulgaris) collected from Western European populations (Rendell and Ennos, 2002). Four combinations of primers and enzymes revealed a total of eight mutations that collectively revealed twelve different haplo-types. The distributions of these haplotypes revealed high levels of diversity within populations and also substantial genetic differences among populations. The authors compared their results with an earlier study based on nuclear allozyme data and concluded that, unlike the earlier examples of coniferous trees in Chapter 1, seeds are more important than pollen for the long-distance dispersal of heather.

DNA Sequences

In Chapter 1 we saw how DNA sequences can be obtained from fragments of DNA that have been amplified by PCR. Although all genetic markers quantify variations in DNA, sequencing is the only method that identifies the exact base pair differences between individuals. This is an important feature of DNA sequencing because it leaves little room for ambiguity: by comparing two sequences we can identify exactly where and how they are different. As a result, sequencing allows us to infer the evolutionary relationships of alternate alleles. This is possible because, barring back-mutations, each mutation acquired by a specific lineage remains there, even after additional mutations occur. In other words, if an allele with a sequence of GGGATATACGATACG mutates to a new allele with a sequence of CGGATATACGATACG, then all descendants of the individual with the new allele will have a C instead of a G at the first base position, even if subsequent mutations occur at other sites along the sequence. Generally speaking, the more mutations that a pair of individuals has in common, the more closely related they are to one another, a concept that will be developed further in Chapter 5.

Sequence data were used to unravel the evolutionary history of the Hawaiian silversword alliance, a group of 28 endemic Hawaiian plant species in the sunflower family (Baldwin and Robichaux, 1995). These plants are of interest because collectively they demonstrate substantial morphological and ecological variation.

Throughout the Hawaiian archipelago they inhabit a range of wet (including sedgeland and forest) and dry (including grassland and shrubland) habitats, in contrast to their less ecologically diverse continental relatives. By comparing sequences from coding and non-coding regions of the nuclear ribosomal DNA genes, the authors of this study were able to identify both shared and unique mutations, which in turn allowed them to conclude which species are most closely related to one another. They were then able to reconstruct the events that led to the evolution of such a diverse group. The sequence data suggest that this group of species arose from a single ancestor that was dispersed, presumably by birds, to the Hawaiian archipelago some time in the past. By combining ecological and genetic data, the authors then could go one step further and conclude that shifts between wet and dry habitats occurred on multiple occasions. This would suggest that ecological diversification played an important role in the speciation of this alliance.

The variable rates of sequence evolution (Table 2.4) mean that we can use relatively rapidly evolving sequences for comparing closely related taxa, and more slowly evolving sequences for comparing distantly related taxa. In recent years, the choice of appropriate gene regions has been facilitated by the growing availability of sequence data. Nevertheless, although sequencing theoretically can be applied to any genomic region, our knowledge of chromosomal sequences is still inadequate

Table 2.4 Evolutionary rates of some DNA sequences. All estimates are from Li (1997), with the exception of the value given for mitochondrial protein-coding regions in mammals, which is from Brown, George and Wilson (1979). To put the low values in perspective, recall that the diversity of 0.1 % in the human nuclear genome translates into a roughly three million base pair difference between individuals

Type of sequence


Average divergence (% per million years)'

Nuclear DNA

Non-synonymous sites

Mammals Drosophila






Plant (monocot)

Synonymous sites

Mammals Drosophila


Plant (monocot) Mammals

Chloroplast DNA

Non-synonymous sites Synonymous sites

Plant (angiosperm) Plant (angiosperm)

Mitochondrial DNA

Non-synonymous sites Synonymous sites Protein-coding regions

Plant (angiosperm) Plant (angiosperm) Mammals


"These are estimates averaged over multiple loci.

in most taxa and there is a shortage of universal nuclear primers. Universal primers are more abundant for plant and particularly animal organelle genomes; in fact, animal mtDNA has been the source of data in most sequence-based ecological studies.

Although sequence data can be extremely informative, obtaining these data is quite expensive (although decreasingly so) and time-consuming. Development time will be longer if a number of sequences need to be screened before appropriately variable regions are identified. Furthermore, many studies benefit by having data from more than one genetic region, which further adds to the time and expense. Recently, however, a relatively new method known as single nucleotide polymorphisms (SNPs) has been gaining in popularity because it is specifically designed to target variable DNA bases in multiple loci.

Single nucleotide polymorphisms

Single nucleotide polymorphisms (SNPs) refer to single base pair positions along a DNA sequence that vary between individuals. Most SNPs (pronounced snips) have only two alternative states (i.e. each individual has one of two possible nucleotides at a given SNP locus) and are therefore referred to as biallelic markers. Although technically just another way of looking at sequence variation, SNPs are given their own classification because they provide a new approach for finding informative sequence data; DNA sequencing generally entails a comparison of sequences between individuals to see how much variation exists, whereas individual polymorphic sites must be identified before they can be classified as SNPs. Once we know that particular sites are variable, we can use these SNPs to genetically characterize both individuals and populations. Although still in its infancy, the use of SNPs as molecular markers seems to hold great potential.

There is no doubt that SNPs are widespread. In the human genome they account for approximately 90 per cent of genetic variation (Collins, Brooks and Chakra-varti, 1998). A survey of multiple taxa including plants, mammals, birds, insects and fungi suggested that a SNP will be revealed for every 200--500 bp of non-coding DNA and every 500--1000 bp of coding DNA that are sequenced (Brum-field et al., 2003, and references therein). This proved to be a conservative estimate in pied and collared flycatchers, Ficedula hypoleuca and F. albicollis, because when researchers screened around 9000 bp from each species, they discovered 52 SNPs in pied flycatchers and 61 SNPs in collared flycatchers (Primmer et al., 2002). This translates into an average frequency of approximately one SNP per 175 bp and 150 bp for pied and collared flycatchers, respectively. This is encouraging, as the search for SNPs in the nuclear genome will initially be random in most non-model species.

SNPs can be identified in a relatively straightforward manner by sequencing PCR products that have been amplified using universal or species-specific primers, or by sequencing anonymous loci such as those amplified by the multi-locus methods outlined below. By targeting multiple loci, researchers should be able to identify a number of SNPs distributed across multiple unlinked sites throughout the nuclear or organelle genomes. The mutation rates of SNPs appear to be in the order of 10~8-10~9 (Brumfield et al., 2003). This range is lower than the mutation rates of some other markers such as microsatellites (see below) and therefore the most promising application of SNPs in molecular ecology currently appears to be the elucidation of processes that occurred some time in the past. However, SNPs have been developed only recently, and as increasing numbers are characterized, SNPs are likely to prove suitable for a range of other applications, such as using SNP genotypes to identify individuals and to assess levels of genetic variation within populations (Morin, Luikart and Wayne, 2004).


Microsatellites, also known as simple sequence repeats (SSRs) or short tandem repeats (STRs), are stretches of DNA that consist of tandem repeats of 1-6 bp. An example of a microsatellite sequence is the dinucleotide repeat (CA)12, which consists of 12 repeats of the sequence CA (CACACACACACACACACACACACA). In this case the complementary DNA sequence would have the microsatellite (TG)i2. Microsatellites are located throughout nuclear and chloroplast genomes and have also been found in the mitochondrial genomes of some species (Figure 2.6). The initial development of microsatellite markers can take considerable time and money. The usual approach is to clone random fragments of DNA

Locus 1 Locus 2 Locus 3



Figure 2.6 Diagrammatic representation showing part of a chromosome across which three microsatellite loci are distributed (note that sequences are provided for only one strand of DNA from each chromosome). This particular individual is homozygous at locus 1 because both alleles are (TA)4, heterozygous at locus 2 because one allele is (TAA)8 and the other is (TAA)7, and heterozygous at locus 3 because one allele is (GC)7 and the other allele is (GQS

into a library and then screen this library with a microsatellite probe (in much the same way as RFLPs are identified with probes). Clones that contain microsatellites are then isolated and sequenced, and primers that will amplify the repeat region are designed from flanking non-repetitive sequences (Figure 2.7). Once primers have been designed, data can be acquired rapidly by using these primers to amplify microsatellite alleles in PCR reactions. The PCR products then can be run out on a high-resolution gel that will reveal the size of each allele. The number of species from which microsatellite loci have been characterized is growing almost daily, and the sequences that flank microsatellite loci are often conserved between closely


Figure 2.7 A DNA sequence that includes a microsatellite region that was isolated from the freshwater bryozoan Cristatella mucedo (Freeland etal., 1999). The microsatellite, which is (AG)53, is underlined. The flanking sequence regions in bold show the locations of the primers that were used to amplify this microsatellite in a PCR reaction related species, which means that microsatellite primers sometimes can be used to generate data from multiple species (e.g. Lippe, Dumont and Berhatchez, 2004; Provan et al., 2004).

Microsatellites mutate much more rapidly than most other types of sequences, with estimated mutation rates of around 10~4-10~5 events per locus per replication in yeast (Strand et al., 1993) and around 10~3-10~4 in mice (Dallas, 1992). This is substantially higher than the estimated overall point mutation rate of around 10~9--10~10 (Li, 1997). These high rates of mutation in microsatellites are ascribed most commonly to slipped-strand mis-pairing during DNA replication (Chapter 1), which, because it can result in either the gain or loss of a single repeat unit, has given rise to the stepwise mutation model (SMM; Kimura and Ohta, 1978). Alternatively, the infinite alleles model (IAM; Kimura and Crow, 1964) allows for mutations in which multiple repeats are simultaneously gained or lost, but it also assumes that any new allele size has not been encountered previously within a population. Mutation models are important for the analysis of genetic data, but reconciling particular models with the evolution of microsatellites is complicated by the fact that, although mutations often involve single repeats, multiple repeats are periodically gained or lost following a single mutation. At other times, insertions or deletions in the flanking sequences will alter the size of the amplified region. There is also considerable evidence suggesting that the mutation of microsatellites is influenced by the number and size of the repeat motif and also by the complexity of the microsatellite, e.g. whether it is composed of one or multiple repeat motifs (Estoup and Cornuet, 1999).

Microsatellite data are not particularly useful for inferring evolutionary events that occurred in the relatively distant past. Their rapid rate of mutation and their tendency to either increase or decrease in size means that size homoplasy may often occur. This can be illustrated by an example of two ancestral alleles at the same locus, one with 20 dinucleotide repeats and the other with 16 dinucleotide repeats. If the larger allele loses one repeat and the smaller allele gains one repeat, then both mutations will have led to alleles with 18 repeats. These two new alleles, each with 18 repeats, may appear to be two copies of the same ancestral sequence, but the evolutionary histories of the two alleles are in fact quite different. Size homoplasy means that ancestor-descendant relationships may be difficult to untangle from microsatellite data.

On the other hand, the high mutation rates of microsatellites mean that there are often multiple alleles at each locus, and this high level of polymorphism makes them suitable for inferring relatively recent population genetic events. East African cichlid fishes were therefore prime candidates for microsatellite analysis, because thousands of endemic species evolved in Lakes Malawi and Victoria within the last 700 000 years, and some species are believed to be only around 200 years old (Kornfield and Smith, 2000). Initial explorations of these species using mtDNA or allozymes revealed little information. The problem was that, even when polymorphic genetic regions were identified, the recency of speciation events meant that most alleles were still shared among taxa because there had not been enough time for species-specific alleles to evolve. In the 1990s, however, microsatellite markers identified much higher levels of variation within and among cichlid species (reviewed in Markert, Danley and Avnegard, 2001). As a result, researchers have been able to use microsatellite data to resolve some aspects of the evolutionary history of cichlid groups (Kornfield and Parker, 1997; Sultmann and Mayer, 1997), although their analyses were somewhat hampered by size homoplasy.

The variability of microsatellites means that, unlike some of the more slowly evolving gene regions, they can also be used to discriminate genetically between individuals and populations. This application has provided some interesting insights into cichlid mating systems. In one study, a combination of behavioural and microsatellite data was used to investigate the role of assortative mating in speciation. Although hybrids from the same lake are often fully fertile, females were found to consistently select males based on their highly divergent colour patterns, ignoring the overall shape similarity that might otherwise blur the division between species. Conclusions from the behavioural data were supported by microsatellite data, which identified the different morphs as genetically distinct taxonomic groups (van Oppen et al., 1998). The high variability, co-dominant nature and increasing availability of microsatellites have made them one of the most popular types of markers in population genetics; however, their extensive development time means that there is still considerable support for dominant markers, to which we shall now turn.

Was this article helpful?

0 0

Post a comment