cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Choose Language Hide Translation Bar
Analyse Columns
jthi
Super User

 

Description

Analyse columns is a tool which will perform fairly quick pre-determined summary statistics (explained below) for your discrete (nominal/ordinal) and continuous data. It also allows you to perform some quick launch tasks directly from the Analyse Column's window such as delete columns, create scatterplot matrix and add largest nines to missing value codes. 

 

This add-in uses Summary table to calculate most of discrete summary statistics and Distribution platform to calculate continuous summary statistics. These were chosen for their ease of use while scripting and speed.

 

Examples below are from using Analyse Columns on all columns of JMP's sample data table Probe.

 

User Interface

jthi_0-1664298063775.png

Launch window which allows user to select columns and sampling rate.

 

jthi_1-1664298087961.png

Progress bar with functional Cancel button

 

jthi_6-1664299256461.png

Main window with Side Panel open.

 

UI Main window explanation:

1. Same side panel that can be seen in data tables, this can be closed from the triangle on top left corner.

2. Short information about the analyzed data table (rows, column, analyze duration). Clicking on the table name will bring it to front

3. Summary statistics table for Nominal / Ordinal columns

4. Summary statistics for Continuous columns. This also has horizontal scroll bars so remember to scroll to left and right

5. Action buttons

6. Tab pages to change between analysis window and view of the original data table.

 

 

Summary Statistics - Discrete (Nominal/Ordinal)

See Using JMP > Summarize Your Data > Explanation of Summary Statistics for most of the summary statistics.

  • Column - Name of the analysed column
  • Type - Type of the column. First character indicates data type (C - Character, N - Numeric) and second modeling type (N - Nominal, O - Ordinal)
  • N Prop - Count of column properties
  • N - Count of non-missing values
  • N Distinct - Count of distinct values (N categories)
  • N Unique - Count of values which have only one value
  • % Unique - Percentage of columns which are unique
  • N Missing - Count of missing values
  • % Missing - Percentage of missing values
  • First - First value in alphanumeric order
  • N First - Count of first values
  • Last - Last value in alphanumeric order
  • N Last - Count of last values
  • Mode - Most common value
  • N Mode - Count of most common values

 

 

Summary Statistics - Continuous

See Basic Analysis > Distributions > The Distribution Report > The Summary Statistics Report for more detailed explanation of most of the statistics.

  • Column - Name of the analysed column
  • Type - Type of the column. First character indicates data type (N - Numeric) and second modeling type (C - Continuous)
  • N Prop - Count of column properties
  • % Int - Percentage of integer values
  • N - Count of non-missing values
  • N Distinct - Count of distinct values (N categories)
  • N Missing - Count of missing values
  • % Missing - Percentage of missing values
  • N Zero - Count of zero (0) values
  • % Zero - Percentage of zero values
  • Nines - Numeric value of highest nines
  • N Nines - Count of highest nines
  • Mode - Most common value
  • N Mode - Count of most common value
  • Min - Minimum value
  • N Min - Count of minimum values
  • Max - Maximum value
  • N Max - Count of maximum values
  • Median - Median value
  • Mean - Mean value
  • Std Dev - Standard deviation
  • N Pat - Count of robust outliers. Calculated as 6sigma outliers using median as mean and IQR/1.35 as sigma (AEC - Q001 Rev-D)
  • MAD - Median Absolute Deviation
  • IQR - Interquartile range
  • Kurtosis - Measures peakedness or heaviness of tails.
  • Skewness - Measures sidedness or symmetry.
  • Nonparam Skew - Non-parametric skewness. (Mean - Median) / Std Dev. More info Nonparemetric Skew (wikipedia)
  • Autocorrelation - First Order autocorrelation that tests if the residuals are correlated across the rows.
  • Best Fit - By default this is empty. When Best Fit button is pressed, this will show the best fit of data from Normal, Normal 2 Mixture, Gamma, Weibull and Exponential distributions.

 

Action Buttons

Actions are performed mostly on selected columns. If platform is launched, it will be pre-filled with selected columns.

 

  1. Prev - Go to previous selected column
  2. Next - Go to next selected column
  3. Distribution - Launch distribution with 5 graphs for each row
  4. Best Fit - Fits distributions to the columns and selects best fit. Be careful this and use it only for quick EDA, as the fit this provides is most likely not the correct choice.
  5. Explore Outliers - Launch Explore Outliers platform
  6. Missing Data Pattern - Launch Missing Data Pattern platform
  7. Explore Missing - Launch Explore Missing Data Patterns platform
  8. Scatterplot Matrix - Launch Multivariate and Correlation platform
  9. MDMvCC - Launch Model DrivenMultivariate Control Chart
  10. Time Series - Creates Time Series plots for selected columns. Uses Row() ordering
  11. Set Nines Missing - Sets highest nines to Missing Value Codes (can be repeated)
  12. Cast Role - Casts selected columns to a role
  13. Standardize Attributes - Launches Standardize Attributes
  14. Group Columns - Groups selected columns
  15. Subset Columns - Creates a subset with only selected columns
  16. Hide&Exclude Cols - Hides and Excludes selected columns
  17. Clear Selection - Clears column selections
  18. Delete Columns - Deletes selected columns
  19. Select Selected - Selects same columns from Analyse Columns as have been selected from the original data table.
  20. Refresh Selected - Re-calculates summary statistics for selected columns
  21. Refresh All - Re-calculates summary statistics for same columns as chosen by user during launch
  22. Help - Opens this page
  23. Exit - Closes application

                                                                                                                                                                

 jthi_0-1664633376817.png

 


 

 

When Cast Role is pressed following window will open:

jthi_5-1664298636760.png

 

Inspiration

This add-in was inspired by the need to quickly get overview of new large data sets (Pandas Profiling is one such existing library).

 

 

Example of exploratory analysis with analyse columns using Probe.jmp

View more...

Run Analyse columns for all columns.

Quickly check if there column which have most values missing

jthi_0-1664693755910.png

or if values are mostly the same (you can re-order by clicking on header)

jthi_1-1664693774881.png

Select some of the columns which have most of the values same and use distribution and subset to check what they look like

jthi_2-1664693841563.png

jthi_3-1664693867380.png

As values are mostly same, they might not be that useful in further analysis. For demo purposes, we will use Hide&Exclude to remove these from analysis.

Next quick check could be to see, if there are some Continuous columns which should be possibly recoded as Nominal or Ordinal. Again Distribution and Subsets are good quick tools for this (looking for example for version numbers, id numbers and such). These values don't seem to be such values

jthi_4-1664693988554.png

Next we will check if there are possibly nines used instead of missing values and these seems to be quite a few columns like that

jthi_6-1664694139481.png

Analyse Columns will look for highest absolute nines and use those as Nines, it won't drop then based on quantiles or such. Some of those seem to have quite interesting situation where there are values larger than Nines. Again, we use distribution and subset to explore them in more detail

Distributions seem quite quite fine:

jthi_7-1664694238912.png

Next we create subset with those columns and take a closer look. These 9999 rows and missing values seem quite suspicion to me.

jthi_8-1664694324427.png

For example column 30N1_4X20_HFEPEAK*VA10U has 55 9999 values, doesn't feel completely normal to me. Let's create summary table of that column and order by N Rows.

jthi_9-1664694428168.png

This would require more knowledge of the process, but if I had to make a guess these are failed measurements / missing values, even though they are not even close to the largest values in the column

jthi_10-1664694531173.png

For demo purposes we conclude that those are missing values and use Set Nines Missing to exclude them WITHOUT losing data. After we have used Set Nines Missing, we should use Refresh Selected to refresh summary statistics calculations for those columns

Before:

jthi_12-1664694596898.png

After:

jthi_13-1664694616706.png

Next we could take a look at first order autocorrelation to see if the data isn't random and has some "row based" dependencies. There are quite a few columns with high autocorrelation, Start Time being obvious. We select some of the high autocorrelation columns and use Time Series to see what is going on

jthi_14-1664694723458.png

Seems like that there could be some dependencies which is caused by time.

 

There are still quite a few checks we could do, such as looking for correlations (we should clean outliers first for example with explore outliers) or use Model Driven Multivariate Control Charts to look for interesting patterns but I think we have enough to demonstrate what can be done with Analyse columns for now.

 

 

Possible future work

  • Add function to create report similar to Pandas Profiling
  • Allow user to choose which summary statistics to calculate
  • Allow user to choose which summary statistics to show and save these settings
  • Improve speed of calculations
  • Allow calculation of capability statistics if continuous column has specification limits
  • Save calculated statistics to data table as table variable / table script, to allow quick re-run of the tool. Multiple of these should be saved, so user can keep "change" log of summary statistics and/or demonstrate effect of sampling on summary statistics
  • Add some sort of an option to "tag" columns with notes. This will require most likely some sort of new window to write the notes and to retrieve then as custom column property is best for this (Notes would be best, but user cannot see if those have been set to column).

 

Change Log

24.12.2022 - Removed company logo from UI

Comments
jthi

Just noticed there is a bug with Continuous Table Box and the values are being shifted by one when using JMP17. Most likely the platforms I'm using for summary statistics have changed from JMP16 -> JMP17. When I have time I'll take a look and try to fix the issue.