cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Try the Materials Informatics Toolkit, which is designed to easily handle SMILES data. This and other helpful add-ins are available in the JMP® Marketplace
Choose Language Hide Translation Bar
markschahl
Level V

How to efficiently parse structured text strings of varying number of elements

Happy New Year Everyone!

 

I started exploring this dataset from Kaggle: https://www.kaggle.com/chicago/chi-restaurant-inspections. Each row is a restaurant inspection, with a single column for list of violations. Thankfully, the violations are delimited by "|". Sample below. The number of violations varies.

 

40. REFRIGERATION AND METAL STEM THERMOMETERS PROVIDED AND CONSPICUOUS - Comments: STEM THERMOMETER NEEDED TO MONITOR FOOD TEMPERATURES. INSTRUCTED TO PROVIDE AND MAINTAIN. | 41. PREMISES MAINTAINED FREE OF LITTER, UNNECESSARY ARTICLES, CLEANING  EQUIPMENT PROPERLY STORED - Comments: OBSERVED A FEW ITEMS NOT ELEVATED 6INCHES OFF FLOOR IN REAR PREP AREA. MUST ELEVATE AND MAINTAIN. | 38. VENTILATION: ROOMS AND EQUIPMENT VENTED AS REQUIRED: PLUMBING: INSTALLED AND MAINTAINED - Comments: EXPOSED HAND HASHING SINK IN REAR KITCHEN FOOD PREP AREA DRAINING SLOW. MUST REPAIR AND MAINTAIN. | 32. FOOD AND NON-FOOD CONTACT SURFACES PROPERLY DESIGNED, CONSTRUCTED AND MAINTAINED - Comments: OBSERVED RUSTY FOOD STORAGE SHELVING ON BASEMENT. MUST REPAINT AND MAINTAIN. | 34. FLOORS: CONSTRUCTED PER CODE, CLEANED, GOOD REPAIR, COVING INSTALLED, DUST-LESS CLEANING METHODS USED - Comments: OBSERVED DIRT AND DEBRIS ON FOOR UNDERNEATH PALLETS ON BASEMENT. MUST CLEAN AND MAINTAIN.

 

Looking at the structure: ViolationID#. Violation_Type - Comments. I can think of two ways to parse these strings into ID#, Type, Comments:

 

1) Cols > Utilities > Text to Columns, then Tables > Stack, then delete blank rows, then parse the new column with Word(text,delimiter)

2) Use Words(Violations, "|") to create a list like the one below then work on each of the elements in the list

 

{"40. REFRIGERATION AND METAL STEM THERMOMETERS PROVIDED AND CONSPICUOUS - Comments: STEM THERMOMETER NEEDED TO MONITOR FOOD TEMPERATURES. INSTRUCTED TO PROVIDE AND MAINTAIN. ", " 41. PREMISES MAINTAINED FREE OF LITTER, UNNECESSARY ARTICLES, CLEANING  EQUIPMENT PROPERLY STORED - Comments: OBSERVED A FEW ITEMS NOT ELEVATED 6INCHES OFF FLOOR IN REAR PREP AREA. MUST ELEVATE AND MAINTAIN. ", " 38. VENTILATION: ROOMS AND EQUIPMENT VENTED AS REQUIRED: PLUMBING: INSTALLED AND MAINTAINED - Comments: EXPOSED HAND HASHING SINK IN REAR KITCHEN FOOD PREP AREA DRAINING SLOW. MUST REPAIR AND MAINTAIN. ", " 32. FOOD AND NON-FOOD CONTACT SURFACES PROPERLY DESIGNED, CONSTRUCTED AND MAINTAINED - Comments: OBSERVED RUSTY FOOD STORAGE SHELVING ON BASEMENT. MUST REPAINT AND MAINTAIN. ", " 34. FLOORS: CONSTRUCTED PER CODE, CLEANED, GOOD REPAIR, COVING INSTALLED, DUST-LESS CLEANING METHODS USED - Comments: OBSERVED DIRT AND DEBRIS ON FOOR UNDERNEATH PALLETS ON BASEMENT. MUST CLEAN AND MAINTAIN."}

 

I don't have a lot of experience using lists.

 

Would it be more efficient to work with the list {}, use the Text to Columns approach or is there some better way? If the list approach is the most efficient, what is the most efficient approach to do the equivalent to Tables > Stack for each row, element in the list?

 

Alternatively, there is a column with InspectionID#. Would it be more efficient to break this into two tables using the InspectionID# as the key instead of creating a massive stacked table?

1 ACCEPTED SOLUTION

Accepted Solutions
Craige_Hales
Super User

Re: How to efficiently parse structured text strings of varying number of elements

Just playing with the first 100 rows, here's a map:

LocationsLocations

Here's a formula column to get just the comment fields:

a = :Violations;
b = "";
Pat Match( a, // full violation text
    Pat Repeat(
        Pat Arb() + "Comments:" 
         + ( Pat Break( "|" ) | Pat Rem()) >> c // the interesting bit
         + Pat Test( // not really a test, just some JSL to build text
            b = b || "|" || c; // build text here
            1; // always succeed the test
        )
    ),
    NULL, // no replacement
    FULLSCAN // get the final one
);
b;

Text explorer can use the new column to form clustersclustersclusters

 

Plotting the cluster number for the comments against the actual inspection result looks promising; cluster 2 comments did not get failing grades and cluster 4 comments failed. Typically something like "OBSERVED GREASE BUILD UP ON VENTILATION HOOD FILTERS. MUST CLEAN VENTILATION HOOD FILTERS" (pass) vs "NO HAND WASHING NOR REMOVAL OF FOOD GLOVES" (fail).

clustered by comment textclustered by comment text

 

 

Craige

View solution in original post

3 REPLIES 3
Craige_Hales
Super User

Re: How to efficiently parse structured text strings of varying number of elements

I don't think the 145K rows will be too slow for any of the parsing ideas. I think you will probably want indicator columns for each kind of violation, 0 or 1. If you have JMP 13, you can use text explorer on the comments; I'd suggest running all the comments, without the constant descriptions, together into a single character field.

 

Craige
Craige_Hales
Super User

Re: How to efficiently parse structured text strings of varying number of elements

Just playing with the first 100 rows, here's a map:

LocationsLocations

Here's a formula column to get just the comment fields:

a = :Violations;
b = "";
Pat Match( a, // full violation text
    Pat Repeat(
        Pat Arb() + "Comments:" 
         + ( Pat Break( "|" ) | Pat Rem()) >> c // the interesting bit
         + Pat Test( // not really a test, just some JSL to build text
            b = b || "|" || c; // build text here
            1; // always succeed the test
        )
    ),
    NULL, // no replacement
    FULLSCAN // get the final one
);
b;

Text explorer can use the new column to form clustersclustersclusters

 

Plotting the cluster number for the comments against the actual inspection result looks promising; cluster 2 comments did not get failing grades and cluster 4 comments failed. Typically something like "OBSERVED GREASE BUILD UP ON VENTILATION HOOD FILTERS. MUST CLEAN VENTILATION HOOD FILTERS" (pass) vs "NO HAND WASHING NOR REMOVAL OF FOOD GLOVES" (fail).

clustered by comment textclustered by comment text

 

 

Craige
markschahl
Level V

Re: How to efficiently parse structured text strings of varying number of elements

Craige:

 

Thanks! Being a text analytics newbie, I have never used the Pattern Matching functions before. Nice solution to the problem. I will definitely spend some time learning about regex and pattern matching. 

 

BTW - text analytics would make a great Mastering JMP Webinar topic!