cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Try the Materials Informatics Toolkit, which is designed to easily handle SMILES data. This and other helpful add-ins are available in the JMP® Marketplace
Choose Language Hide Translation Bar
Genome-wide association study in plants using JMP® Pro 18

Introduction

A genome-wide association study (GWAS) is an approach that involves rapidly scanning genetic markers (or SNP or Single Nucleotide Polymorphism) across the complete sets of DNA (genomes) of many subjects to find genetic association with a particular phenotype or trait. Researchers use GWAS to identify genomic variants statistically associated with a particular trait.

A GWAS is applied in animals and plants. However, GWAS have been more successful in plants than in humans (Reference). In particular, it is hoped that GWAS can help with the challenge of sustainably feeding a growing global population, an issue that becomes more critical each year. By 2050, it is expected that they Earth’s population will be close to 10 billion people – which translates to roughly 3 billion more mouths to feed than in 2010. There is indeed a need to better harness the ability of plants to provide us with food, hence the research and development for breeding better plants by using genetic engineering. Here are some notable examples of GWAS applications to meeting this challenge:

  • Improving oil yield, drought tolerance, and vitamin E in sesame.
  • Identifying genes associated with stress tolerance, oil content, and seed quality in brassicas (which include broccoli, cabbage, cauliflower, and other greens).
  • Identifying a gene that can confer blight resistance in maize.

Researchers continue to explore these associations to enhance agriculture and understand adaptive processes.

 

The challenge

Applying association-mapping approaches in plants is complicated by the population structure present in most germplasm sets. As population structure can result in spurious associations, it has constrained the use of association studies in plant genetics since genetic variants can vary between populations due to their geographic backgrounds. This variation is referred to population stratification. Association mapping holds great promise if true signals of functional association can be separated from the vast number of false signals generated by population structure.

 

GWAS takes the population structure into account

Population structure plays a crucial role in GWAS. Population structure refers to systematic differences between allele frequencies among subpopulations. In a randomly mating population, allele frequencies are expected to be roughly similar between groups. However, mating tends to be nonrandom to some degree, causing a structure to arise. For example, a barrier like a river can separate two groups of the same species and make it difficult for potential mates to cross. As a result, This mutation is not necessarily a mutation that will be associated to a particular trait. Therefore, population structure can be a confounding variable in genetic studies. Accounting for and controlling its effect is essential in GWAS. False positive associations can occur if population structure is not properly addressed. To mitigate population structure effects, mixed models have emerged to correct for population structure and relatedness. Let’s take a closer look to see if JMP Pro can handle this problem.

 

The data

For this analysis, we’ll use data on rice that contains 343 lines, for a total of 8336 markers across five different subpopulations. Below is a description of the different subpopulations:

Picture1 Rice Subpopulation.png

Current distribution of the five major subpopulations of rice in Asia.

Reference: https://doi.org/10.1371/journal.pgen.0030133.g001

 

With this data, the idea is to find markers associated with the following traits: days of flowering, plant height at maturity, weight of individual panicle, and grain yield.

Graph 1 Distribution of the four traits.png

Distribution of the four traits

 

The resolution

Response screening

Response screening is a powerful platform that automates the process of conducting tests across many responses, making it a perfect scenario for genetic data. With JMP Pro 18, it is now possible to add population structure information in to the model, such as IBS (Identity By State) relationship matrix. Doing this, it is than possible to do mixed models, using the responses as fixed effects and population structure as random effects.

Graph 2 Dialog Box on Response Screening.jpg

Dialog box of the fit model using the response screening as model

 

In my previous blog on population genetic structures analysis in humans, I described how to use the marker relatedness platform. To summarize, the marker relatedness measures the relatedness between individuals by creating a so-called relationship matrix based on IBS (Identity By State) that examines the effects of marker identity on similarity between unrelated individuals.    

Response screening was run first without and then with the population structure (using a mixed model) to determine the difference in terms of the level of signals.

Graph 3 Response Screening results without and with the relationship matrix.jpg

Response screening results without and with the relationship matrix

 

The report shows the association significance between a single marker and a single trait in a table and in a plot. In the table, red Logworth (-log 10 of the p-value) indicates a strong association between marker and trait. In the effect plot, the red point and horizontal red line reflect the raw p-values and the p-value threshold, respectively; the blue points and horizontal blue line reflect the FDR (false discovery rates) transformed p-values, respectively. Tests with FDR p-values that fall below the blue line are significant at the 0.05 level when adjusted for the false discovery rate. Tests with ordinary p-values that fall below the red line are significant at the 0.05 level without adjusting for the false discovery rate. In this way, the plot enables you to read FDR significance from either set of p-values. Hence, signals below the threshold are declared to be significant.

In both cases, with raw and FDR adjusted p-values, we can clearly see a difference with or without population structure. There are many more significant markers associated to a trait based on the response screening without population structure than with. Thus, using a mixed model approach, it’s possible to find an association between various traits , adjusting simultaneously for population structure and thus decreasing the false signals.

 

Manhattan plot

Now, let’s look at the results in another way. By taking the data table results from the response screening and joining them with the annotation (and thus adding marker location and chromosome information), a so-called Manhattan plot can be created. In the Manhattan plot below, on the x axis we see that the markers are grouped by chromosomes (in color) and the logworth on the y axis is grouped by the four different responses – with population structure on the left and without population structure on the right. Logworth with high values is considered significant. It shows that for all the responses, lower logworth values consider the population structure and thus decrease the false signals.

Graph 5 Animated Manhattan Plot.png

Manhattan plot showing logworth for each response colored by chromosome with population structure (left) and without population structure (right)

 

Graph 4 bubble plot.gif

Animated Manhattan plot showing logworth for one response (plant height at maturity) colored by chromosome

 

Identifying the true and false signals is simply a matter of subtracting the logworth with population structure minus logworth without population structure. In the graph below, we can see the superimposed logworth with (in green) and without (in red) population structure. The arrows highlight signals in markers that are supposed to be highly associated with a trait (considering no population structure), The bottom of the graph represents the difference between logworth with and logworth without population structure, allowing the true and false positives to be distinguished. It clearly shows a significant difference when comparing the logworth with and without population structure. Look at the number of false positives! (negative difference values).

 

Graph 6 showing the FDE logworth difference with and without population structure.png

Graph showing the logworth difference with and without population structure

 

Conclusion

This blog describes how JMP Pro can handle mixed models for GWAS, taking into account population structure. False positive associations can occur if population structure is not properly addressed. To mitigate population structure effects, mixed models are the right tools if population structure plays an important role.

 

 

 

 

Last Modified: Jul 17, 2024 10:00 AM