Accessing & managing data for statistical discovery
Nov 2, 2011 9:30 AM
In a previous blog article, a high-level process for statistical discovery was presented. Now we go much deeper into the first two steps: accessing and manipulating data with JMP.
The standard JMP data format is the JMP data table. It is a high-performance in-memory data structure that can handle data, metadata and even programming code in just one file.
In order to start the analysis of our data, you first need to get your data into a JMP table. Your source data can be present in a number of different systems or sources, and JMP provides you with the capabilities to import external data.
Getting Excel data into JMP
One of the most typical source files is Microsoft Excel file format. Typically, Excel files are in a standard data format:
• Row 1 - for column names
• Rows from 2 to the last - for data
If your Excel files are in this format, importing into JMP is as easy as dragging and dropping the file from an Explorer folder to the JMP home window. (If you are a more traditional user, you can still do File -> Open from your JMP main menu or JMP Starter). If your Excel file contains many folders, depending on JMP Preferences you will import them all together or you will be prompted to choose one of them. In any case, JMP will automatically recognise data and will put numbers into numbers, characters into characters and Excel date/time into JMP date/time variables.
If your data is not shaped as above, you can get it in the required shape within Excel or simply copy the required cells and paste them into a new JMP table.
Accessing other types of source files
Another common source file is a text file, as CSV or TXT file. My favourite way to import these is with File -> Open and Open as Data with Preview: You will launch an easy two-step wizard in which you will be able to select column names, type, formats, delimiters and so on.
There are JMP users that are SAS users too, and have a lot of datasets to analyse. The quickest way is to File -> Open them (however, I like to drag and drop them as well). If your company has a SAS server too, you can still do as above, but the best way is to use File -> SAS menu: if you haven’t previously connected to your SAS server you will be prompted to identify your SAS server, then via the SAS server connection you can select a dataset, get a preview of the data and select which rows and columns to import. Moreover, everything the SAS administrator has defined for the table is inherited, including automatic conversion of SAS formats into JMP value labels.
With File -> Open, other files can be accessed too, such as SHP file for maps or DBF. The available list depends on your installation.
If your data is in a relational database format, such as Microsoft Access, MySQL, Oracle, SQL Server, or DB2, by creating an ODBC data source, you can access relational databases by using File -> ODBC. As with SAS, you will be asked to select the table, rows and columns you need (If you are familiar with SQL, you can write your SQL Pass-Thru code too, or you can have someone else write SQL queries for you to access).
A method in JMP for accessing data from the Web is File -> Internet Open -> Open as Data. If the Web page you reference contains one or more HTML tables, JMP will recognise them and import into JMP with correctly formatted columns: You can try this technique, for example, with the Wikipedia page listing US states and territories by area.
How about when you do not have data at all and you have to manually insert them? If they are casual data, you click on File -> New and you can create data from scratch by using JMP as a “spreadsheet.” Or, if your new data are coming from an experiment designed with one of the several JMP design of experiments platforms, you will get a template with chosen factors and you will insert only the experimental results.
Next step: managing data
Once you get the data in JMP, it may not be the case that you can immediately start to analyse it: Remember that you need a flat table with observations as rows and all needed variables as columns. You may need to make some steps in terms of rearranging, integrating, selecting, deriving data to reach that goal. That’s what all platforms in Tables menu, and some in Rows and Columns menus, do.
The Tables menu offers a very intuitive toolbox of things you can do with a JMP table. Clear statements and nice accompanying icons remind you of the purpose of each function. Summary computes statistical summaries of one or more columns with the option of grouping by one or more different columns. Subset allows you to select part of the rows based on some conditions, and you can Sort observations according to ascending or descending values of one or more columns.
Favourite platforms for data preparation
For our topic of preparing data for the analysis, my favourite platforms in this menu are Stack and Split. They put your data in a right orientation for analysis –that is, a row represents a single observational unit with all related measures (variables) on it. You can Stack values from several columns into several rows in one column, and you can Split a column, mapping several rows on one column to one row in several columns.
The Transpose platform helps you to correctly shape your data if raw file variables and observations are switched. It is a one-click knob to switch them back to the correct order.
What if my data is in many places?
Many times the information you need is scattered among many tables. If the observations are stored in different tables with the same format produced at different times, you can easily Concatenate all of them into a single table.
If the additional information you need is stored in another table, and in this case you need to add columns to your first table, you can Join the two tables by matching the values of one or more ID fields, selecting then the columns and rows you need from both. The Update platform is just a special case of the latter and again does what its name suggests.
Skipping the Tabulate platform, that for its pivot-table nature is rather a reporting than a data management tool, the Missing Data Pattern gives you a nice diagnostic of how thick or sparse your data is in terms of missing values. The tree map graph automatically defined in the resulting data table is self-explanatory.
Other data management tools
Other powerful data management tools are available from the Rows and especially the Columns menu. Just to mention two of the most commonly used, you can check your data for wrongly entered values using Validation or Recode from the Columns menu.
Last but not least, there is the Formula Editor available from Column Properties. With the Formula editor, you derive information by creating a new column based on existing columns values linked by a Formula. You can have simple formulas or very complex ones, based on the wide range of functions (numeric, character, statistical, etc.) present in the JMP Scripting Language. But do not be scared: The beauty of Formula Editor is that it is powerful yet easy to use, since as always in JMP you have an interactive drag-and-drop intuitive interface to create all kind of formulas.
That concludes the data access and management story. Now you are ready to do your analysis with JMP!