Our World Statistics Day conversations have been a great reminder of how much statistics can inform our lives. Do you have an example of how statistics has made a difference in your life? Share your story with the Community!
Choose Language Hide Translation Bar
Staff (Retired)
Truly efficient clinical reviews – it’s all about the keys

In last week’s post, we discussed some of the upcoming features of JMP Clinical 4.1 that identify new and modified records when clinical trial data is updated. These tools can greatly accelerate clinical reviews, allowing the clinician, statistician or data manager to focus exclusively on unreviewed records. Here we discuss the details of how JMP Clinical makes this possible.

Understanding the discussion below will go a long way to ensure that your JMP Clinical review experience will be as efficient and pleasant as possible. Even if the presentation seems a bit technical, the team responsible for generating the SDTM or ADaM data sets will need to add a single processing step (if they aren’t performing it already). Otherwise, most of us can remain blissfully unaware of the additional steps.

In short, it’s all about the keys.

Keys typically give us access to something, whether it’s a locked box or a room. When discussing data sets, keys give us insight into the uniqueness of a record or data set row. The CDISC SDTM Implementation Guide defines the following terms:

  • Natural Key  is one or more variables whose contents uniquely distinguish every record (row) in the data set. For example, each row of the DM domain should represent a different subject. The natural keys in this instance could be Study Identifier (STUDYID) and Unique Subject Identifier (USUBJID).
  • Surrogate Key is an artificially established single-variable identifier that uniquely identifies rows. This could include any of the xxSEQ variables. For example, if the vital signs data set contained 200 records, the VSSEQ variable could be numbered 1 to 200 to uniquely identify the rows.
  • Alternatively, xxSEQ can be made part of a natural key so that xxSEQ can count from 1 to ni, where ni is the total number of records for a subject. Here the keys would be STUDYID, USUBJID and xxSEQ.

    So why is this important? Well, in order to examine a record (row) for differences between two snapshots, there needs to be a way to link these two versions of the record together. This is where the keys come in. Otherwise, JMP Clinical has no way to know which records to match together. Further, in order to save or access notes for a particular record, there needs to be a way to file the note away so that it is accessible later when returning to the record. Again, this is where the keys come in.

    So how can a user provide JMP Clinical with the keys to all of the data domains for a study? This is actually quite trivial. If you’ve ever used PROC CONTENTS, the output header for a data set contains various information about the data set. One of these pieces of information is “SORTED:          YES/NO”. If the data set happens to be sorted (i.e., YES), then additional information is provided in the PROC CONTENTS output after the description of the data set variables. For example, when I use PROC CONTENTS on the DM domain for Nicardipine, I get the following row in the output:  “Sortedby       STUDYID USUBJID”.  This metadata is stored in the SAS formatted data set; the variables used for the data set sort is what JMP Clinical uses to define the keys for a study.

    So how can these values be saved to the metadata  of a data set? Try either:

    PROC SORT data = DM out = out.DM;




    data out.DM(sortedby = STUDYID USUBJID);

    set DM;


    If the study domains do not have the SORTEDBY metadata associated with the data sets, JMP Clinical attempts to derive the keys based on suggestions provided in the SDTM Implementation Guide. However, the keys generated may not be the optimal set for a given domain.

    So that happens if the supplied keys do not define the records (rows) uniquely? When the study is first added to JMP Clinical, a duplicate report is provided for each affected domain that details the records (rows) that cannot be uniquely determined. These records (rows) will still be labeled as New in JMP Clinical. However, any record-level notes that are system- or user-generated would be associated with two or more records. This may be OK if there are few duplicates to contend with, but any duplicates should be reviewed as potential data errors (data that was mistakenly entered twice). When the study data is updated and redundancies remain, JMP Clinical has no way to match these records. In other words, it cannot assess whether any changes were made to the records or not. Again, if there are few duplicates, these records (rows) can be reviewed at multiple snapshots for correctness.

    Some other important tips:

    (1)  When you first add a study, examine the duplicate report. Identify the keys for each domain and make sure any duplicates are kept to a minimum (ideally, not present). Otherwise, the reviewing functionality will not be as useful as it ultimately could be. For example, if the vital signs (VS) domain was sorted only by STUDYID and USUBJID using the PROC SORT code above, all records for the subject would be considered duplications. This would include multiple tests (such as heart rate, systolic and diastolic blood pressures) or records belonging to different visits.

    (2)  If you perform (1) and there are numerous duplications for all domains, remove the study from JMP Clinical and re-add once more-appropriately defined keys have been applied to the data sets.  It’s important to get this step correct before the study is updated to new snapshots or record-level notes are generated.

    (3)  Try to choose the smallest number of variables possible to define the keys, and choose variables that are not likely to change values. If a record has a change in one of the variables that makes up the keys, there would be no way to match the record to previous versions of the record. However, since all records have Unique Subject Identifier (USUBJID), it is possible to view all notes at the subject level. Use the CDISC Implementation Guide for recommendations.

    (4)  Given (3), I would not use terms that rely on medical coding as part of the keys (i.e., AEDECOD based on MedDRA or CMDECOD based on WHODRUG). There are two reasons for this. First, medical coding may not be immediately available. This provides an opportunity for a missing value of AEDECOD to change to a non-missing coded term later on. Further, sometimes over the course of a study, coded terms may change based on new insights of the clinical team. I would recommend using verbatim terms such as AETERM or CMTRT.

    (5)  The xxSEQ variable or STUDYID, USUBJID and xxSEQ set may be good keys to use since these values are unlikely to change. HOWEVER, the xxSEQ variable must be carefully maintained so that the number never changes for a particular record. For example, suppose a CM data set contains two records:

    CMSEQ             CMTRT                  CMSTDTC

    1                         ASPIRIN                03-20-1974

    2                        IBUPROFEN         03-27-1974

    and is updated through query with a new med that actually falls between the first two based on date:

    CMSEQ             CMTRT                  CMSTDTC

    1                         ASPIRIN                03-20-1974

    2                        IBUPROFEN         03-27-1974

    3                        VITAMIN C           03-24-1974

    It is important that any new records are tacked at the end (and to continue the sequence of CMSEQ). Alternatively, if a record is deleted:

    CMSEQ             CMTRT                  CMSTDTC

    2                        IBUPROFEN         03-27-1974

    3                        VITAMIN C           03-24-1974

    The sequence number must be kept consistent (i.e., 1 can never be used again). If your company tends to define xxSEQ as a straight 1 to N for all records or 1 to ni for each subject without any concern for what the row is, using xxSEQ as a key is not a good choice.

    (6)  Alternatively, a single non-CDISC variable can be included in each domain and added to the SORTEDBY metadata. A good choice may include a record-identifier variable output from any data management system.

    Next week, we examine an example in detail.

    Article Labels

      There are no labels assigned to this post.


    Truly efficient reviews of clinical trials â an example wrote:

    [...] Workflow or check for record duplication using Open Duplicates Report based on our discussion from last time (Figure 2). Youâ ll notice that this report describes the keys used for each data set and whether [...]


    Truly efficient clinical reviews - patient profiler in JMP Clincal wrote:

    [...] in the upcoming JMP Clinical 4.1. If you're new to the conversation, feel free to catch up here, here and here. The ability of JMP Clinical to identify new or modified data from snapshot to snapshot, [...]