Analysis of molecular variance

Image

We’ve already encountered π, the nucleotide diversity in a population, namely

Where xi is the frequency of the ith haplotype and δij is the fraction of nucleotides at which haplotypes i and j differ. It shouldn’t come to any surprise to you that just as there is interest in partitioning diversity within and among populations when we’re dealing with simple allelic variation, Wright’s F-statistics, there is interest in partitioning diversity within and among populations when we’re dealing with nucleotide sequence or other molecular data. The approach I’m going to describe is known as Analysis of Molecular VAriance (AMOVA). We’ll see later that AMOVA can be used very generally to partition variation when there is a distance we can use to describe how different alleles are from one another, but for now, let’s stick with nucleotide sequence data and think of δij simply as the fraction of nucleotide sites at which two sequences differ.

Analysis of molecular variance

Many innovative analytical methods have been developed recently to assess and accommodate genetic background heterogeneity. The vast majority of these methods involve some form of cluster analysis, although some more recent methods do not. For example, hierarchical clustering strategies can be used to assess genetic background clustering, and, like other cluster analysis methods, require the construction of a measure of the similarity or dissimilarity (genetic distance) between all pairs of the N individual genomes or population allele frequency profiles (e.g., between-group variation, FST) comprising a sample. The resulting N × N similarity or distance matrix is then explored statistically to identify clusters of individuals or populations that exhibit greater or lesser similarity. Problems inherent to this approach involve the choice of a similarity metric, deciding which cluster method is most appropriate (e.g., single linkage, complete linkage, etc.), the determination of the optimal number of clusters representing the data, and the biological meaning of the clusters.

With respect to the choice of a similarity metric for cluster analysis, the simplest marker-based method for the assessment of genetic similarity between two individuals is to calculate the fraction of alleles shared identical by state (IBS) by those individuals over all the loci for which the individuals have been genotyped. If N individuals have been genotyped, then all N × N pairs of individuals can be assessed in this way. In addition to providing a foundation for some cluster analysis methods, graphical displays of the similarity matrix can be produced that allow visual assessment of the potential that subgroups of individuals with similar genetic backgrounds exist in the data. This approach has been used widely, and is often referred to, when presented in graphical form as a dendrogram, or as an allele-sharing “tree of individuals”. One problem, however, with the simple IBS sharing measure of genetic background similarity is that it does not account for allele frequencies. Consider, for example, two individuals who share rare alleles.

These individuals are more likely to have arisen from the same (unique) population in which those alleles arose. In this situation one may want to consider “weighting” allele sharing at each locus by the frequency of the shared (or unshared) alleles. Pairwise measures of genetic similarity that accommodate allele frequencies have been put forward and are used often in ecological and nonhuman population genetics analysis settings.

Cluster analysis approaches can be extended by making more explicit and rigorous assumptions about the ancestral populations from which the individuals in a sample arose. Thus, specific ancestry informative markers (AIMs), which show large frequency differences between ancestral populations, can be used to quantify the degree of admixture among individuals in a sample.

When an individual genotyped on such markers possesses variations that are more frequent in one of the chosen ancestral populations, then that individual's ancestral relationship to this population can be inferred. Obviously, one need to have identified the appropriate AIMs in advance of such analyses and this requires assumptions about the ancestral populations contributing to the individual genetic backgrounds reflected in a sample.

In the following we describe a flexible alternative to cluster analysis–based methods for the statistical assessment of genetic background similarities among populations or individuals. The proposed method does not necessarily rely on AIMs, but does require genotype information on at least a few hundred (possibly less when including AIMs) genetic markers (null loci) such as microsatellites, single nucleotide polymorphisms (SNPs), and/or insertion–deletion polymorphisms.

Although one can use markers that are not completely independent in the sense that they have alleles in linkage disequilibrium, this practice may require the use of a greater number of markers to make up for the lack of independence of the markers. Null loci can include genotype data available from, e.g. a previous genome-wide association or linkage studies involving the subjects or populations of interest, and could thus allow for a retrospective analysis of sample genetic background structure without additional genotyping. As in cluster analysis, the proposed method involves the construction of a genetic similarity matrix. However, it does not require cluster analysis to test hypotheses about the relationships of the individuals or populations in a sample. Rather, the method assumes that interest lies in testing the relationship between a particular grouping factor (e.g., race, country of origin, cohort, or geographical locale) or quantitative measure (such as age, cholesterol level, or weight) and variations in the genetic similarities of the individuals or populations collected. Therefore, it does not require the determination of the optimal number of clusters or, e.g., principal components, representing the data.

    You can send your manuscript at https://bit.ly/34yzqXs  

    Media Contact:

    Lina James            

    Managing Editor

    Mail Id: computersci@scholarlypub.com

    American Journal of Computer Science and Engineering Survey

    Whatsapp number: + 1-504-608-2390