Subscribe Bookmark



May 27, 2014

Assessing the similarity of clinical trial subjects within study site

We’ve reached the end of our series of posts on fraud detection in clinical trials (for now, at least). Our final discussion focuses on the similarity of subjects within the clinical site, a topic that I hinted at in my response to a comment to one of my earlier posts. As part of that discussion, we described an upcoming JMP Clinical Analytical Process (AP) that would search for duplicated records within and across study subjects contained within a clinical site.

There are, however, some shortcomings of this particular tactic. First and foremost, if an investigator or other site personnel is going to engage in such activities, it seems unlikely that he or she would be so careless as to copy pages or records verbatim without modifying some of the values. Merely changing a duplicated heart rate by a single BPM would prevent the above check from detecting any wrongdoing. Further, while we may be able to generate frequencies of exact hits between the two subjects across several domains, we are not really generating a similarity metric between subjects since we can’t directly compare records that differ.

Based on our discussion from last time, we would like to use as much of a subject’s data as possible to assess the similarity between any two individuals within a study center. A metric such as Mahalanobis distance or Euclidian distance (with each covariate centered and scaled) can be used to ascertain just how “alike” any two subjects are. Further, we can use hierarchical clustering techniques to assess if there are subgroups of subjects that appear related to one another.

The upcoming JMP Clinical AP Cluster Subjects within Study Site performs the analyses described in the preceding paragraph. Similar to the Multivariate Inliers and Outliers AP, data from CDISC Findings, Events and Interventions domains are used to generate a data set comprising one row per study subject. However, rather than compute Mahalanobis distance to the centroid of all subjects, a Euclidian distance (with each covariate centered and scaled) matrix for each center is generated to assess the similarity of all subjects within that center. We use Euclidian distance for this application since PROC DISTANCE is capable of computing distance in the presence of missing values for covariates. Users can, however, delete variables with high rates of missing data from the analysis (the default here is also 5%).

Figure 1 summarizes the pairwise distances at each site using box plots. The plot to the left allows us to compare subject-similarity across the study centers. This is an important plot since we need to first answer the question “How similar is similar?" Of course, a distance of 0 between any two subjects would indicate an exact copy, but as analysts we need to have some idea of what is reasonable to expect between subjects among the different sites. Recall, fraud is likely occurring at one or a small number of sites, so this figure should help us identify and sites that are “different.” The plot on the right summarizes the distribution of the most similar pair of subjects from each site. While we may be motivated to identify duplicated subjects, such an analysis also helps describe the study population that may be available at the clinical site. Small or large boxes would indicate centers with a more homogenous or heterogeneous population, respectively.

Though an analysis is performed for each site, by default JMP Clinical only presents the analysis for the center with the minimum overall between-subject distance (Figure 2). Results from analyses of other sites are available using drill-downs. The plot in the left summarizes the distribution of all pairwise distances for subjects available at Site 44 from a clinical trial of Nicardipine. The heat map and dendogram summarize the distance matrix and hierarchical clustering analysis for subjects within the site. From the box plot, we can select the minimum pairwise distance and highlight these individuals within the heat map to identify and further explore subjects with whom they may be clustered. Further, using additional features, we can open the  data table to examine the covariates of these two subjects (or a larger cluster to which they belong) side-by-side.

These last five weeks we have explored several ways to identify potentially fraudulent data from a clinical trial. Is clinical trial fraud preventable? Two often-cited and effective approaches are to streamline study entry criteria and to reduce the amount of data collected. Regular clinical trial monitoring may also prevent or limit the occurrence of such activities. However, such extensive on-site review is time consuming, expensive, and as is true for any manual effort, limited in scope and prone to error. Recent literature suggests risk-based source data verification as a means to reduce costs and enhance efficiency. Particularly in these cases, extensive computerized logic and validation checks early in the clinical trial will help ensure data quality, but can be used to identify potentially fraudulent activities.