JMPer Cable

Mark_Bailey · Jul 24, 2018 09:40 AM

Text data presents special challenges to the data analyst. A new training course helps to guide text exploration and analysis.

Text is a common form of data, but it presents unusual challenges to the data analyst.

First, text is available in many file formats and characters in a text file may use one of many different encodings. The individual files must be imported into a single JMP data table for processing and analysis.

Second, text is unstructured data. It does not appear in the same way as measurements (continuous variables), counts, or categories. A single value of text data is called a document. This value might be a single sentence, a few paragraphs, or an entire book. The value might be a text or email message, a blog post, a restaurant review, or a field warranty report to name just a few of the possibilities. A sample of documents is called a corpus. JMP stores the entire corpus in a single data column with the character data type and the unstructured text modeling type. Each document occupies one row in the data column.

Third, text is heterogeneous! One document might contain many kinds of information such as the names of people, places, products, failures, or features; telephone numbers; dates and times; ages; gender; currency; social security numbers, product numbers, or anything else. The information content might not be consistent from one document to the next. Much of the content might not be informative or useful at all. How is the information found? How is it decided whether it is information?

Fourth, text is messy! Structured data, by comparison, may be easily recoded, transformed, or binned. How is a document recoded? How are acronyms, abbreviations, synonyms, inconsistent punctuation and capitalization, and spelling errors to be handled?

We have developed new training about exploring and analyzing text data with Text Explorer in JMP Pro 14. (Note that the first half of the course requires only JMP 14.)

Much of the work in the analysis of text data is about curating the list of terms. Much of the time in the training is focused on a workflow that begins with the raw corpus and ends with the curated term list. This list becomes the heart of subsequent analyses, so the quality of the list is important. When the list is finished, it may be exported to the original table or as a new data table in the form of new, structured variables. These new variables may be explored on their own or in combination with other structured variables with any of the other JMP platforms.

Text Explorer in JMP Pro provides three powerful, multivariate methods for extracting more information from the list of terms. Latent Class Analysis discovers clusters of documents that share a common pattern of terms. Latent Semantic Analysis discovers implicit topics in the corpus. Discriminant Analysis scores documents to categories provided by another variable.

Do you have text data waiting to be analyzed? Do you want help using the Text Explorer with best practices? The new course will premiere as preconference training at the next JMP Discovery Summit on Oct. 22-23, 2018, in Cary, NC, at SAS headquarters. See you there!