Discussions

DMR · Aug 15, 2018 04:49 AM

Hi,

I have some text files downloaded from the internet, and intend to run some text analytics on them. They're littered with typos however, and I want to tidy them up as much as I can before feeding them into the Text Explorer platform. The most obvious first task will be to correct automatically as many of those typos as possible.

To do that, I need a relatively comprehensive list of common words against which to compare my files. MS Word's spelling checker must have access to such a list. I don't know whether that list is integral to MS Word or is just a standard component of Windows - but either way, does anyone know if I can access that list directly from JMP?

On a related point, how might I set about calculating a "best match" between any incorrectly-spelt word and a list of correctly-spelt English words? I've no doubt there are any number of ways to do this, but there must be some fairly standard ones. (I'm not familiar enough with Regex to know if this is how to tackle such a problem, but if I have to learn it then I will.)

Many thanks.

gzmorgan0 · Aug 15, 2018 03:17 PM

Just an FYI. Attached is a script that defines a function called computeLevenshteinDistance() and includes multiple examples and description of the shortest edit script() function. Also it includes examples of finding the longest common sequence.

This script was written as an extra example for cleaning up text, documented in Chapter 9 of JSL Companion, Applications of the JMP Scripting Language, 2nd Edition, and because we found it to be an interesting function.

When cleaning up survey or comment columns (not an entire document), column Recode can be useful.

View solution in original post

Mark_Bailey · Aug 15, 2018 11:53 AM

I do not know how to access an external list, such as the one that might be used by the spelling checker in MS Word, from JMP. Moreover, the list is used by the checker. The checker would be difficult to duplicate in a JMP script.

You might look into creating a Visual Basic macro for MS Word to check and correct spelling in the original text files before importing them to a JMP data table for analysis.

Regular expressions are a powerful way to match patterns but that approach would become cumbersome if applied to the task of finding all the misspelled words and returning the correct word.

MikeD_Anderson · Aug 15, 2018 12:44 PM

Two things:

1. The LaTeX Community has a number of word lists that can are used with the embedded spell checkers in their IDE's. You might be able to use one of them.

2. Rather than trying to hack your own spell checker, you might try hooking into GNU Aspell directly. It's pretty solid (and I believe is smart enough to ignore code). Not sure how you would do that, but you might be able to accomplish it through a Python call or something.

You can get directed to word lists and the Aspell app here: http://aspell.net.

M

Craige_Hales · Aug 15, 2018 01:37 PM

some word list info in this blog https://community.jmp.com/t5/Uncharted/WordNet/ba-p/28984

A long time ago an algorithm like this was used:

look up the word in an associative array of known good words. If the lookup fails, try some rules for removing s, ing, ed suffixes and look it up again. if it still fails, make suggestions by looking up similar words formed by switching pairs of letters, inserting letters, deleting letters.

I think the suggestions to use an external program are appropriate, and I think regex is mostly not the answer. You might be able to use recode within text explorer, though it will be a manual process.

Here's an interesting function in JMP which might be useful if you decide to make your own solution

shortest edit script("maintainence","maintenance")

{{"Common", "maint"}, {"Insert", "en"}, {"Common", "a"}, {"Remove", "i"}, {"Common",
"n"}, {"Remove", "en"}, {"Common", "ce"}}

The edit script is a description of where the two words differ. You could create a metric from that using number of characters inserted and deleted.

Craige

pmroz · Aug 15, 2018 02:19 PM

For word comparison you could use the Levenshtein distance. https://en.wikipedia.org/wiki/Levenshtein_distance

gzmorgan0 · Aug 15, 2018 03:17 PM

Just an FYI. Attached is a script that defines a function called computeLevenshteinDistance() and includes multiple examples and description of the shortest edit script() function. Also it includes examples of finding the longest common sequence.

This script was written as an extra example for cleaning up text, documented in Chapter 9 of JSL Companion, Applications of the JMP Scripting Language, 2nd Edition, and because we found it to be an interesting function.

When cleaning up survey or comment columns (not an entire document), column Recode can be useful.

DMR · Aug 16, 2018 10:55 AM

Many thanks to everyone who has replied to this thread: every suggestion has been really helpful. I shall be allocating some time to look into the Levenshtein Distance to see what can be done with it (for which the script is going to be invaluable); also the Shortest Edit Script() function, which looks as if it could be very useful. In accordance with the advice received I shan't spend time on Regex, but I may well look into the possibility of writing a Word macro - we'll see how it goes. It also occurs to me that somebody may well have tackled this in R, in which case there might be something out there on the internet that could be imported via JMP's R-related functions. Plenty of avenues to explore, but I'll mark the question as solved.

Once again, many thanks to one and all.

Discussions

Creating a spelling checker

Re: Creating a spelling checker

Re: Creating a spelling checker

Re: Creating a spelling checker

Re: Creating a spelling checker

Re: Creating a spelling checker

Re: Creating a spelling checker

Re: Creating a spelling checker

Recommended Articles