topic Re: How to efficiently parse structured text strings of varying number of elements in Discussions

How to efficiently parse structured text strings of varying number of elements

markschahl — Thu, 04 Jan 2018 16:28:34 GMT

Happy New Year Everyone!

I started exploring this dataset from Kaggle: https://www.kaggle.com/chicago/chi-restaurant-inspections. Each row is a restaurant inspection, with a single column for list of violations. Thankfully, the violations are delimited by "|". Sample below. The number of violations varies.

40. REFRIGERATION AND METAL STEM THERMOMETERS PROVIDED AND CONSPICUOUS - Comments: STEM THERMOMETER NEEDED TO MONITOR FOOD TEMPERATURES. INSTRUCTED TO PROVIDE AND MAINTAIN. | 41. PREMISES MAINTAINED FREE OF LITTER, UNNECESSARY ARTICLES, CLEANING EQUIPMENT PROPERLY STORED - Comments: OBSERVED A FEW ITEMS NOT ELEVATED 6INCHES OFF FLOOR IN REAR PREP AREA. MUST ELEVATE AND MAINTAIN. | 38. VENTILATION: ROOMS AND EQUIPMENT VENTED AS REQUIRED: PLUMBING: INSTALLED AND MAINTAINED - Comments: EXPOSED HAND HASHING SINK IN REAR KITCHEN FOOD PREP AREA DRAINING SLOW. MUST REPAIR AND MAINTAIN. | 32. FOOD AND NON-FOOD CONTACT SURFACES PROPERLY DESIGNED, CONSTRUCTED AND MAINTAINED - Comments: OBSERVED RUSTY FOOD STORAGE SHELVING ON BASEMENT. MUST REPAINT AND MAINTAIN. | 34. FLOORS: CONSTRUCTED PER CODE, CLEANED, GOOD REPAIR, COVING INSTALLED, DUST-LESS CLEANING METHODS USED - Comments: OBSERVED DIRT AND DEBRIS ON FOOR UNDERNEATH PALLETS ON BASEMENT. MUST CLEAN AND MAINTAIN.

Looking at the structure: ViolationID#. Violation_Type - Comments. I can think of two ways to parse these strings into ID#, Type, Comments:

1) Cols > Utilities > Text to Columns, then Tables > Stack, then delete blank rows, then parse the new column with Word(text,delimiter)

2) Use Words(Violations, "|") to create a list like the one below then work on each of the elements in the list

{"40. REFRIGERATION AND METAL STEM THERMOMETERS PROVIDED AND CONSPICUOUS - Comments: STEM THERMOMETER NEEDED TO MONITOR FOOD TEMPERATURES. INSTRUCTED TO PROVIDE AND MAINTAIN. ", " 41. PREMISES MAINTAINED FREE OF LITTER, UNNECESSARY ARTICLES, CLEANING EQUIPMENT PROPERLY STORED - Comments: OBSERVED A FEW ITEMS NOT ELEVATED 6INCHES OFF FLOOR IN REAR PREP AREA. MUST ELEVATE AND MAINTAIN. ", " 38. VENTILATION: ROOMS AND EQUIPMENT VENTED AS REQUIRED: PLUMBING: INSTALLED AND MAINTAINED - Comments: EXPOSED HAND HASHING SINK IN REAR KITCHEN FOOD PREP AREA DRAINING SLOW. MUST REPAIR AND MAINTAIN. ", " 32. FOOD AND NON-FOOD CONTACT SURFACES PROPERLY DESIGNED, CONSTRUCTED AND MAINTAINED - Comments: OBSERVED RUSTY FOOD STORAGE SHELVING ON BASEMENT. MUST REPAINT AND MAINTAIN. ", " 34. FLOORS: CONSTRUCTED PER CODE, CLEANED, GOOD REPAIR, COVING INSTALLED, DUST-LESS CLEANING METHODS USED - Comments: OBSERVED DIRT AND DEBRIS ON FOOR UNDERNEATH PALLETS ON BASEMENT. MUST CLEAN AND MAINTAIN."}

I don't have a lot of experience using lists.

Would it be more efficient to work with the list {}, use the Text to Columns approach or is there some better way? If the list approach is the most efficient, what is the most efficient approach to do the equivalent to Tables > Stack for each row, element in the list?

Alternatively, there is a column with InspectionID#. Would it be more efficient to break this into two tables using the InspectionID# as the key instead of creating a massive stacked table?

Re: How to efficiently parse structured text strings of varying number of elements

Craige_Hales — Thu, 04 Jan 2018 20:30:28 GMT

I don't think the 145K rows will be too slow for any of the parsing ideas. I think you will probably want indicator columns for each kind of violation, 0 or 1. If you have JMP 13, you can use text explorer on the comments; I'd suggest running all the comments, without the constant descriptions, together into a single character field.

Re: How to efficiently parse structured text strings of varying number of elements

Craige_Hales — Fri, 05 Jan 2018 20:02:43 GMT

Just playing with the first 100 rows, here's a map:

Locations

Here's a formula column to get just the comment fields:

a = :Violations;
b = "";
Pat Match( a, // full violation text
    Pat Repeat(
        Pat Arb() + "Comments:" 
         + ( Pat Break( "|" ) | Pat Rem()) >> c // the interesting bit
         + Pat Test( // not really a test, just some JSL to build text
            b = b || "|" || c; // build text here
            1; // always succeed the test
        )
    ),
    NULL, // no replacement
    FULLSCAN // get the final one
);
b;

Text explorer can use the new column to form clustersclusters

Plotting the cluster number for the comments against the actual inspection result looks promising; cluster 2 comments did not get failing grades and cluster 4 comments failed. Typically something like "OBSERVED GREASE BUILD UP ON VENTILATION HOOD FILTERS. MUST CLEAN VENTILATION HOOD FILTERS" (pass) vs "NO HAND WASHING NOR REMOVAL OF FOOD GLOVES" (fail).

clustered by comment text

Re: How to efficiently parse structured text strings of varying number of elements

markschahl — Mon, 08 Jan 2018 18:28:04 GMT

Craige:

Thanks! Being a text analytics newbie, I have never used the Pattern Matching functions before. Nice solution to the problem. I will definitely spend some time learning about regex and pattern matching.

BTW - text analytics would make a great Mastering JMP Webinar topic!