cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
JMP is taking Discovery online, April 16 and 18. Register today and join us for interactive sessions featuring popular presentation topics, networking, and discussions with the experts.
Choose Language Hide Translation Bar
DMR
DMR
Level V

Creating a spelling checker

Hi,

 

I have some text files downloaded from the internet, and intend to run some text analytics on them. They're littered with typos however, and I want to tidy them up as much as I can before feeding them into the Text Explorer platform. The most obvious first task will be to correct automatically as many of those typos as possible.

 

To do that, I need a relatively comprehensive list of common words against which to compare my files. MS Word's spelling checker must have access to such a list. I don't know whether that list is integral to MS Word or is just a standard component of Windows - but either way, does anyone know if I can access that list directly from JMP?

 

On a related point, how might I set about calculating a "best match" between any incorrectly-spelt word and a list of correctly-spelt English words? I've no doubt there are any number of ways to do this, but there must be some fairly standard ones. (I'm not familiar enough with Regex to know if this is how to tackle such a problem, but if I have to learn it then I will.)

 

Many thanks.

1 ACCEPTED SOLUTION

Accepted Solutions
gzmorgan0
Super User (Alumni)

Re: Creating a spelling checker

Just an FYI. Attached is a script that defines a function called computeLevenshteinDistance() and includes multiple examples and description of the shortest edit script() function. Also it includes examples of  finding the longest common sequence.

 

This script was written as an extra example for cleaning up text, documented in Chapter 9 of JSL Companion, Applications of the JMP Scripting Language, 2nd Edition, and because we found it to be an interesting function. 

 

When cleaning up survey or comment columns (not an entire document), column Recode can be useful. 

View solution in original post

6 REPLIES 6

Re: Creating a spelling checker

I do not know how to access an external list, such as the one that might be used by the spelling checker in MS Word, from JMP. Moreover, the list is used by the checker. The checker would be difficult to duplicate in a JMP script.

 

You might look into creating a Visual Basic macro for MS Word to check and correct spelling in the original text files before importing them to a JMP data table for analysis.

 

Regular expressions are a powerful way to match patterns but that approach would become cumbersome if applied to the task of finding all the misspelled words and returning the correct word.

Re: Creating a spelling checker

Two things:

1.  The LaTeX Community has a number of word lists that can are used with the embedded spell checkers in their IDE's.  You might be able to use one of them.

2.  Rather than trying to hack your own spell checker, you might try hooking into GNU Aspell directly.  It's pretty solid (and I believe is smart enough to ignore code).  Not sure how you would do that, but you might be able to accomplish it through a Python call or something.  

 

You can get directed to word lists and the Aspell app here: http://aspell.net.

 

M

Craige_Hales
Super User

Re: Creating a spelling checker

some word list info in this blog https://community.jmp.com/t5/Uncharted/WordNet/ba-p/28984

A long time ago an algorithm like this was used:

look up the word in an associative array of known good words. If the lookup fails, try some rules for removing s, ing, ed suffixes and look it up again. if it still fails, make suggestions by looking up similar words formed by switching pairs of letters, inserting letters, deleting letters.

I think the suggestions to use an external program are appropriate, and I think regex is mostly not the answer. You might be able to use recode within text explorer, though it will be a manual process.

Here's an interesting function in JMP which might be useful if you decide to make your own solution

shortest edit script("maintainence","maintenance")

{{"Common", "maint"}, {"Insert", "en"}, {"Common", "a"}, {"Remove", "i"}, {"Common",
"n"}, {"Remove", "en"}, {"Common", "ce"}}

The edit script is a description of where the two words differ. You could create a metric from that using number of characters inserted and deleted.

 

Craige
pmroz
Super User

Re: Creating a spelling checker

For word comparison you could use the Levenshtein distance.  https://en.wikipedia.org/wiki/Levenshtein_distance

 

gzmorgan0
Super User (Alumni)

Re: Creating a spelling checker

Just an FYI. Attached is a script that defines a function called computeLevenshteinDistance() and includes multiple examples and description of the shortest edit script() function. Also it includes examples of  finding the longest common sequence.

 

This script was written as an extra example for cleaning up text, documented in Chapter 9 of JSL Companion, Applications of the JMP Scripting Language, 2nd Edition, and because we found it to be an interesting function. 

 

When cleaning up survey or comment columns (not an entire document), column Recode can be useful. 

DMR
DMR
Level V

Re: Creating a spelling checker

Many thanks to everyone who has replied to this thread: every suggestion has been really helpful. I shall be allocating some time to look into the Levenshtein Distance to see what can be done with it (for which the script is going to be invaluable); also the Shortest Edit Script() function, which looks as if it could be very useful. In accordance with the advice received I shan't spend time on Regex, but I may well look into the possibility of writing a Word macro - we'll see how it goes. It also occurs to me that somebody may well have tackled this in R, in which case there might be something out there on the internet that could be imported via JMP's R-related functions. Plenty of avenues to explore, but I'll mark the question as solved.

 

Once again, many thanks to one and all.