cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Choose Language Hide Translation Bar
rcarver
Level III
What shape is a data table?

Reshaping data is one invaluable stage in the process of preparing a data table for further analysis and modeling. For most analyses, we organize data into a two-dimensional tabular format in which each row is an individual observation or case, and each column contains a measurement or attribute pertaining to that individual. There are times, though, when this is not the only option or even the preferred option.

Consider a simple pilot study for a controlled experiment. The experiment investigates the effect of a training activity intended to increase a subject’s score in performing a task. Suppose that the experiment is designed so that each subject in the study completes a pre- and post-test, and the scores are recorded. There are eight pilot subjects identified as individuals A through H, four of whom are randomly assigned to use the activity. The other four are the control group.

After data collection, we’ll have two scores for each person, and we’ll know which individuals were in each group. What are our choices for placing the data into a single matrix or data table?

Wide Versus Long Format

If we follow the convention of allocating one row per experimental subject, our data table might look like this:

Table 1: Wide Array of Artificial Experimental Data

Subject

Group

Pre

Post

A

Exp

51

48

B

Ctrl

53

42

C

Ctrl

48

56

D

Exp

44

49

E

Exp

47

59

F

Ctrl

42

43

G

Ctrl

52

39

H

Exp

48

49

 

This format is often referred to as “wide”: All of the attributes and measurements of a given individual are spread across several columns. In this example, we have a matrix of eight rows and four columns with minimal redundancy.

An alternative to the wide format is to rearrange the same data in a “narrow” or “tall” arrangement. This introduces some redundancy in the Subject, Group, and Test columns in exchange for treating the test (pre- or post-) as a categorical variable and treating all of the scores as a single numeric variable.

 

Table 2: Long Arrangement of the Same Data

Subject

Group

Test

Score

A

Exp

Pre

51

A

Exp

Post

48

B

Ctrl

Pre

53

B

Ctrl

Post

42

C

Ctrl

Pre

48

C

Ctrl

Post

56

D

Exp

Pre

44

D

Exp

Post

49

E

Exp

Pre

47

E

Exp

Post

59

F

Ctrl

Pre

42

F

Ctrl

Post

43

G

Ctrl

Pre

52

G

Ctrl

Post

39

H

Exp

Pre

48

H

Exp

Post

49

 

Ultimately, the decision about the preferred shape of a data table should be driven by the logic of your analytical goals. In designing an experiment or data collection from scratch, you should consider the analytical goals from the outset. Frequently, though, the layout of a data table was chosen at an earlier time by others, and the job of the analyst is to render the data into a format that suits the current purpose.

Reasons for Wide and Long Formats

Perhaps the most common reason for arranging data into a wide format is that we are working with longitudinal or repeated measures data. Some analysis platforms might require that each repeated measurement be treated as a variable. (See, for example, Grace-Martin 2015, or SAS Institute 2016a.) For some procedures, JMP can model repeated measures in either wide or long format, as discussed in SAS Institute 2016b. Because repeated measures of the same subject or observational unit are likely to be correlated with one another, they are better treated as multiple variables rather than a single variable.

When the analytical goal involves explicitly viewing time as a factor or predictor, as in time series modeling or visualization, then a long format might be more suitable. For example, a data table should be long to use the Time dimension on the JMP Bubble Plot platform.

The next two sections illustrate the JMP method for converting wide to narrow format and vice versa. The initial illustration uses this artificial data set only to present the method simply before moving on to more complex illustrations.

Stacking Wide Data

In this introductory example, the initial state of the data is wide with one row per subject and the two measurements occupying separate columns. In other words, the two measurements are being treated as two variables. Figure 1 shows the initial state of the data table. Because the design of this hypothetical study uses matched pairs, we might move directly to analysis with the data table.

For other analytical purposes, it could be helpful to reshape the table to the narrow format. We use the Stack command in the Tables menu.


Figure 1: Example Data Table in Wide FormatFigure 1: Example Data Table in Wide Format

  1. Select Tables > Stack to open the dialog box shown in Figure 2.
  2. The goal is to create a new column that represents the test administration as a categorical variable (pre- and post-) and stacks all of the scores into a single column. To do so, we select the two current columns (Pre and Post), and click Stack Columns.

By default, JMP will name the new quantitative column Data and the categorical identifier (that is, the original column names) as Label. It’s probably better to select more meaningful variable titles.

  1. In the lower right of the dialog, specify a descriptive column title for the scores and for the experimental phase.
  2. Complete the dialog as shown in Figure 2 and click OK.


Figure 2: The Stack DialogFigure 2: The Stack Dialog

A new data table opens with the 16 measurements represented in narrow (long) format. Your data table should now look like Figure 3.

 

Figure 3: The Experimental Data in Narrow, or Stacked, FormatFigure 3: The Experimental Data in Narrow, or Stacked, Format

Unstacking Narrow Data

Suppose that the data were initially been recorded in the narrow format (similar to Table 2 and Figure 3), but we want to reorganize it in the wide format. To continue with the example and to reverse the process, we’ll use the Split platform in the Tables menu. The dialog appears in Figure 4. Here we need to specify the column that identifies the basis for unstacking (in this case, the Test column) as well as the column containing the data that will populate the multiple new columns.

By default, JMP will drop all columns except those chosen in the split specification. For this example, it will make sense to retain the information about the subject and the experimental group.

  1. So click the button next to Keep All under Remaining columns as shown in Figure 4. The dialog also provides the option to rename the new wide table.
  2. Complete the dialog as shown and click Create.

Figure 4: Split Table dialogFigure 4: Split Table dialogThe results of this command appear in Figure 5. This nearly brings us full circle, with the one exception that JMP alphabetizes the order of the split columns. So, in contrast to Figure 1, the post-test data is now the third rather than the fourth column. You would need to note this in using the Matched Pairs platform for a pair-sample t-test or confidence interval.  Since JMP 13, the Matched Pairs platform is among the Analyze > Special Modeling options. Veteran JMP users have previously found it in the Analyze menu.

 

 

Figure 5: The Experimental Data after Split to Wide FormatFigure 5: The Experimental Data after Split to Wide Format

 

The content of this post is an excerpt of Chapter 7 of my book, Preparing Data for Analysis with JMP, where you will find more techniques for working with tables, and many more data cleaning tips.

References

Last Modified: Jun 26, 2018 9:59 AM