Subscribe Bookmark

Community Manager


Feb 27, 2013

Using the Relationships Among Study Procedures to Assess Data Quality

This article appears in JMPer Cable, Issue 30, Summer 2015.

by Richard C. Zink richard.zink, PhD, Principal Research Statistician Developer, JMP Life Sciences

JMP® Clinical simplifies data discovery, analysis and reporting for clinical trials. With its straightforward user interface and primary reliance on graphical summaries of results, JMP Clinical allows the entire clinical trial team to explore data to identify safety or quality concerns. The upcoming Correlated Findings analysis in JMP Clinical 6.0 enables you to uncover quality issues and misconduct in clinical trials.

In the early 1980s, the National Heart, Lung, and Blood Institute sponsored a multicenter study to assess the reproducibility of animal models for myocardial infarction1. The ultimate goal was to determine whether any such models could be used to evaluate the effectiveness of new therapies. During the course of the study, analyses performed at the coordinating center identified many unusual findings at one of four trial sites. Upon further investigation, numerous inconsistent or implausible relationships were identified among several of the variables. After approaching the lab chief at the problematic center, the project officer discovered that the medical fellow conducting the research was guilty of fabricating data for a study conducted the previous year. Armed with this new insight, further analysis uncovered a host of new problems, including noticeable differences in the data produced before and after the date the project officer approached the lab chief at the problematic site.

The previous example highlights an important point: data are highly structured, and human beings don’t effectively fabricate realistic data, particularly in the many dimensions that would be required for it to appear plausible2. Even in the absence of misconduct, examining higher-order moments such as the variance, skewness and kurtosis of each variable or the pairwise correlation among pairs of variables within each site can uncover a problem in data quality. Here, we focus on analyses of pairwise correlations, which is one of the reports available from our Data Quality and Fraud analysis tools. Analyses of correlations have been applied in the past to detect irregularities from questionnaires obtained in clinical trials3.

8968_Zink formula MT.JPG
Initially, I summarize all tests using a volcano plot to highlight the combinations of greater interest: those tests with large numerical differences and/or those meeting the criteria for statistical significance after applying a suitable multiplicity adjustment.

In Figure 1, the X axis represents the difference in untransformed correlations between the suspect site versus its reference (all other sites). The Y axis represents the raw p-value from the Z-test defined above on the negative log10 scale. The smaller the p-value, the larger the value vertically. In general, the interesting site-pair combinations approach the upper corners of the plot and appear above the dashed red reference line. The markers that are significant account for multiple comparisons. In Figure 1, numerous results are identified among pairs of lab tests for Site 28.

8964_Fig 1.jpg

Figure 1
Correlated Findings Volcano Plot

Figure 2 presents further analysis of the lab tests for Site 28. This graphical representation of the lower triangle of the correlation matrices for Site 28 and its reference highlight the magnitude of pairwise associations. With File > Preferences > Graphs > Filled Areas > Selected Outlined selected, it is possible to outline any selected correlations of interest, such as those that are highly statistically significant. Notice that Site 28 has much stronger positive correlations for the six selected pairs, which comprise various combinations of four lab tests: alanine aminotransferase (ALT), aspartate aminotransferase (AST), lactate dehydrogenase and prothrombin time. As an aside, some authors note that fabricated data tend to exhibit stronger correlation than non-fabricated data6.

8965_Fig 2.jpg

Figure 2
Correlation Plots for Site 28 laboratory measurements

Analyzing these four variables using Graph > Scatterplot Matrix generates Figure 3. This plot shows numerous bivariate outliers that contribute to the strong correlations of Site 28. Many of the outliers come from a single patient.

8963_Fig 3.jpg

Figure 3
Scatterplot Matrix for selected Site 28 laboratory measurements

While extreme values for any single lab test might indicate a potential safety concern, multiple tests that are jointly extreme might indicate something more severe, such as a drug-induced liver injury. Running Hy’s Law Screening in JMP Clinical identifies this patient (among several others) as a potential case of liver toxicity, not too surprising  given the extreme results for the liver tests (ALT/AST) presented in Figure 3.


  1. Bailey, K.R. (1991). Detecting fabrication of data in a multicenter collaborative animal study. Controlled Clinical Trials 12: 741-752.
  2. Venet, D., Doffagne, E., Burzykowski, T., Beckers, F., Tellier, Y., Genevois-Marlin, E., Becker, U., Bee, V., Wilson, V., Legrand, C., Buyse, M. (2012). A statistical
    approach to central monitoring of data quality in clinical trials. Clinical Trials 9: 705-713.
  3. Taylor, R.N., McEntegart, D.J., Stillman, E.C. (2002). Statistical techniques to detect fraud and other data irregularities in clinical questionnaire data. Drug Information Journal 36: 115-125.
  4. Fisher, R.A. (1921). On the “probable error” of a coefficient of correlation deduced from a small sample. Metron 1: 3-32.
  5. Fisher, R.A. (1970). Statistical Methods for Research Workers, Fourteenth Edition. Davien, CT: Hafner Publishing Company.
  6. Akhtar-Danesh, N., Dehghan-Ko. M. (2003). How does correlation structure differ between real and fabricated data-sets? BMC Medical Research Methodology. Available at:
Article Tags