Subscribe Bookmark



May 27, 2014

Accessing data at scale from databases

tease_tabsel_resizedMany JMP users get their data from databases.

A few releases ago, we introduced an interactive wizard import dialog to make it easier to import from text files. In a subsequent release, we created a feature that lets you import Web page tables into JMP data tables. In JMP 11, we introduced an interactive wizard to simplify importing data from Excel spreadsheets, covering such functionality as going across tabs and stacking groups of columns. But there was one area that we had yet to improve: making it easier to import data from databases – until now.

So what are the problems to solve? Data from databases usually means an abundance of data from many different tables. To get at the data you want, you need to join and filter, something that requires too much knowledge about the data and too much expertise with the SQL language.

The amazing thing about how a database stores data is how regularized it is, so that the same data is not duplicated across multiple tables. There are three huge advantages to this:

1. Since data is not duplicated, there is less wasted storage space.

2. Updating a field is a matter of updating one place instead of many.

3. Data is naturally more consistent.

Databases are organized for efficient storage and efficient transactions, but not usually for efficient analysis access.

Often the data you need to analyze is scattered across multiple tables. You have to join them to obtain what you need, which can make for significant work.

You could just import all the data from each table, and then join them inside JMP, but there are two reasons why this isn’t the best method. First, you usually do not need all the data from each table, so you waste time and space when you import more than you need. Second, it is a burden to specify how to join the tables together.

Alternatively, you could use SQL to specify a join and then select the rows and columns to import. While this makes for more efficient use of the hardware, it isn’t always easy since you have to learn SQL, become familiar with the details of each table, and write and debug the SQL code to send to the database.

But JMP can help you do all of these things with ease.

Databases are organized to help with access. Each table usually has a primary key to uniquely identify each row of the table. The table also has “foreign keys,” which are columns made to match primary keys in other tables — this relates that table to other tables. The typical join is then an “outer join,” matching the foreign key in one table to the primary key in another table. This is a one-to-many (outer) join, since a foreign key may not be unique but a primary key is unique to its table.

In JMP, the new Query Builder dialog is organized to make this process easy. First, you specify the primary table, the one whose rows make up the unit of analysis. This primary table will usually have columns designated as foreign keys. Simply specify some secondary tables, and suddenly, the join becomes obvious and automatic.

Next, you need to select the variables and rows to keep. Selecting the variables is easy in Query Builder — select them from a list that shows the table from which each variable comes, along with the data type. Selecting rows involves specifying filters, but this is easy, too. Query Builder retrieves all the categories for categorical variables and the ranges for continuous variables, and then presents controls so you can pick just the categories and ranges you want.

But it gets even better, since Query Builder goes several steps further. You can preview any table or prospective join output right in the dialog window. You can also examine the resulting SQL code, plus you can save the query to a file so you can use it again to get the data. When you save the query, you can specify that some of the filtering conditions be set to prompt you, so that the next time you run the query, you can change the categories or ranges you specify to filter the data.

You can read about Query Builder in this blog post from Eric Hill, the lead developer of the facility.

We live in the age of big data, and most of this data is stored in databases. Making it easy to get this data into an analysis helps turn “big data” into “big statistics.”

Note: This is part of a Big Statistics series of blog posts by John Sall. Read all of his Big Statistics posts.