Population studies are a fascinating subfield of genetics that focus on understanding genetic differences within and among populations to reveal a population’s genetic evolution. One of the key concepts that I focus on in this blog is examining population structure, in other words, analyzing genetic information across individuals originating from different parts of the world to see if JMP can categorize them in different continental and ethnic groups. Different techniques are used for this purpose, including multivariate embedding, marker relatedness, and linkage disequilibrium used from marker statistics.
GSE10331 is a data set that sheds light on the genetic variation within the worldwide human population. This study aimed to uncover the migration, range expansion, and adaptation of the human species by analyzing genome-wide patterns of variation across individuals.
In this study, researchers analyzed and publicly released high-quality genotypes of more than 500K single-nucleotide polymorphisms (SNPs) in a worldwide sample of 28 ethnic origins (population), spanning 22 countries and seven regions, for a total of 441 individuals. Data was downloaded and prepared for analysis in JMP Pro.
Here is the distribution of the individuals across regions, countries, and populations and its geographic mapping.
Counts of individual grouped by region, country, and population (ethnicity)
Geographic mapping of the different countries analyzed in the study
Multivariate embedding analysis using genetic SNP data
Multivariate embedding was used as a first attempt to see whether the different individuals’ origins can be mapped and categorized into groups or clusters. Multivariate embedding allows a very high-dimensional set of data to be mapped into a low-dimensional space, meaning that points that are near each other in high-dimensional space are also near each other in the resulting low-dimensional space. In this case, all of the 500K SNPs are included using the t-SNE method, which demonstrates how the data clusters.
t-SNE component 1 versus t-SNE component 2
This analysis used the entire data, containing all chromosomes. Each individual is represented by one specific dot. Colors reflect the seven regions. The resulting image clearly separates the individuals among their regions of origin, with African origins having the most distant cluster (in red). The other orgins are easy to distinguish: America in green, East-Asia in orange, Oceania in yellow, Europe in turquoise, Central South Asia in blue, and Middle East in purple. Some interesting patterns show that European individuals are closer to populations from the Middle East and Central South Asia. East Asia, America, and Oceania clusters are far apart from the others, which makes sense as human migration is influenced (or more precisely, repressed) by various factors, including geographic barriers such as mountains and oceans.
t-SNE component 1 versus t-SNE component 2 with label
Two other interesting patterns grabbed my attention. One was the small Central South Asia cluster in blue, close to East Asia in orange. Looking more closely at the individuals’ ethnicity of this central South Asia cluster, it was revealed that they had “Uygur” origins, a Turkic ethnic group originating from and culturally affiliated with the region of Central and East Asia. They are one of China’s 55 officially recognized ethnic minorities. The other outliers, which belong to Middle East in purple, have a Mozabite ethnicity from Algerian origins, hence in Africa. Interestingly, when decomposing the t-SNE analysis into specific chromosomes and animating the t-SNE scores in the bubble plot, we can see one Mozabite Algerian individual (labelled GSM264380) falls in either the African or the Middle East cluster.
t-SNE decomposed by chromosome, animated in a bubble plot
Looking at the t-SNE scores in each individual chromosome, we can see this pattern even more clearly. In all the chromosomes (except for chromosome 5, 10, 13, 14, 16, and 18), this individual falls in the Africa cluster (red); in the other cases, it is in the Middle East cluster (purple).
t-SNE analysis decomposed by chromosome
Marker relatedness
This report is another way to assess different measures of genetic relatedness between pairs of individuals based on their genetic markers. The outputs are principal components, and hierarchical clustering based on different types of identity measurements is used to assess the relationship between individuals. Identity by States (IBS), which examines the effects of marker identity on relatedness of the individuals, is used for this example.
The regions were used to assess relationships. The first analysis was the marker relatedness with all the markers (500K). Each region was compared with all other regions. From the output data of the marker relatedness, I used a heat map in Graph Builder to visualize the IBS scores for each comparative pairs of region. High IBS scores (in red) indicate a high relationship; low IBS scores (in yellow) indicate a low relationship. From the first heat map below, we can see that Africa is the least similar to all other regions (see red highlighted square and low IBS score in light yellow). America and Oceania have the most similarity with itself (squares outlined in blue).
Whole genome-based IBS scores represented in a heat map
The second marker relatedness analysis was done solely on the mitochondrial DNA chromosome. Why? Over the last three decades, mitochondrial DNA (mtDNA) has been the most popular marker of molecular diversity (Mitochondrial DNA as a marker of molecular diversity: a reappraisal, N.Galtier et al., 2009). mtDNA is maternally inherited, passed from the mother to offspring. Unlike nuclear DNA, mtDNA lacks recombination, leading to a clonal inheritance pattern. mtDNA accumulates mutations over time due to its high mutation rate. These mutations create distinct haplotypes, allowing researchers to study population history and migration.
Although the IBS scores increased significantly (see heat map below), we can see the same kind of pattern: the African region has lower IBS scores compared to all the other regions.
mtDNA-based IBS scores represented in a heat map
Linkage disequilibrium in mitochondrial DNA
Linkage disequilibrium (LD) is the amount of statistical association between pairs of alleles at different loci within a genome. The LD calculates a so-called linkage disequilibrium correlation coefficient, which is a measure of a nonrandom association between markers at different positions on the chromosome (loci) within a given population. In other words, it detects how often two markers are detected together at the same loci compared to what would be expected if the loci were independent and associated randomly.
In the context of mtDNA, LD has been a topic of interest for understanding evolutionary history and population genetics. As mentioned previously, mitochondria is solely inherited from the mother in mammals and is considered clonal, meaning that is does not undergo recombination. As a result, mtDNA sequences can serve as a record of historical mutation events in maternal lineages. To analyze and see association between pairs of alleles, the Marker Statistics platform in JMP Pro was used. Here, a plot shows the LD correlation coefficient for pairs of alleles grouped by regions. Red and positive values means that two markers occur together on the same loci more often than expected; blue and negative values mean that two markers occur together on the same loci less often than expected.
Linkage disequilibrium correlation coefficient in mtDNA per region
As there are a lot mtDNA markers to analyze, it makes sense to use a filter to increase the distance differences between pair of markers. Here, we look at LD for pairs of markers at a distance difference greater than 100 nucleotides. It was extremely interesting to see that the same pairs of markers were similarly associated in the different regions, indicating a relationship.
Linkage disequilibrium correlation coefficient in mtDNA per region zoomed
Below is an example of such similarity, obtained by filtering on specific linked markers. Interestingly, Africa is excluded from this pattern of similarity, indicating again that the African region is somehow genetically different from all the other regions.
Linkage disequilibrium correlation coefficient in mtDNA per region for individual marker
In conclusion, by using different techniques in JMP Pro, such as multivariate embedding, marker relatedness, and linkage disequilibrium, the analysis was able to demonstrate the genetic population structure of individuals from different origins.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.