Share your ideas for the JMP Scripting Unsession at Discovery Summit by September 17th. We hope to see you there!
Choose Language Hide Translation Bar
Highlighted
caseylott
Level III

Can I stop Text Explore from re-running after each add phrase/add stopword/or add recode action?

Hi all,

 

I have a large Corpus that I am curating that needs quite a bit of work. I'm finding the process gets bogged down too frequently when Text Explorer automatically re-runs every time I complete an action. For example, If I recode a single term, the standard Text Explorer script runs (e.g., tokenizing, phrasing, stemming...) which can take 20-40 seconds each time. I have to recode hundreds of terms that had OCR errors during text file creation. Is there a way I can turn off this behavior so that I can do, for example, 100 recodes and then have the script re-run?

1 ACCEPTED SOLUTION

Accepted Solutions
Highlighted
caseylott
Level III

Re: Can I stop Text Explore from re-running after each add phrase/add stopword/or add recode action?

I think I may have (mostly) solved this problem, at least in a way that seems like it will work for me. The key was saving the term list, phrase list, and stemming report as data tables. Then, I could work in these tables to add an indicator column for whether or not I want to add the term, phrase, or stem to the "Manage stop words", or "manage phrases" interfaces. Once I finished flagging terms, phrases, or stems in these tables, I could subset each of them by the "yeses" and then paste these directly into the column property box of the "Manage x" interface. This way, I don't lose time waiting for the refresh after I make each decision. 

 

Also, I took a random sample of 1000 records from my larger corpus before doing any of this work. I did curation on this sample and then copied the final column properties to the full data set. This way, when the refresh did run, it didn't take as long. Still... it would be nice to be able to have a platform preference to turn off the automatic refresh and run it only when desired, perhaps as a simple refresh button.

View solution in original post

4 REPLIES 4
Highlighted
Craige_Hales
Staff (Retired)

Re: Can I stop Text Explore from re-running after each add phrase/add stopword/or add recode action?

Use the red triangle->Term Options->Manage... option.

Dialog to manage stop wordsDialog to manage stop words

Craige
Highlighted
caseylott
Level III

Re: Can I stop Text Explore from re-running after each add phrase/add stopword/or add recode action?

Thanks for the reply. Perhaps I’m doing something wrong here, but my current workflow is as follows:

1) open my table and run my most recently saved Text Explorer script.

2) scroll through the list of phrases, select phrases I’d like to look into further, and use “Add Phrase” to add them to the term list.

3) scroll through the term list, select terms or recently added phrases that I want to use as stop words and use “add stop word” to add them to the stop words list.

4) scroll through the term list, select terms I’d like to recode and use the recode dialog to do this.

Each time that I complete one of the actions in numbers 2, 3, or 4, the text analysis script runs and takes time.

When I finish a session of Corpus curation, I go to the Manage stop words, manage phrases, or manage recoded dialogue and take all terms or phrases, which have been stored in the “local” column and send them over to the column property column. Each time I do this, Text Analysis churns some more. If I don’t do this, all of my work is lost the next time I open JMP.

When I’ve done this, I save Text Explorer as a data table script, and save my data table.

The painful part here is that it have to wait for text Explorer to re-run, over and over, any time I recode, add stop words, add phrases, or send my temporary results from the “local” bin of the “Manage x” dialogs to the column property bin.

My corpus has around 5,000 full text journal articles and I have many hours of curation ahead of me due to many text import errors, relatively long documents (6-20 pages of text) and the resulting long term and phrase lists.

I’m away from my office for the day, but tomorrow, I’ll subset my data table to meet the maximum size limit and drop it on this post for anyone to try it out and see what I mean.

It would save me a ton of time if I could keep text Explorer from re-running after each curation action, and just run it periodically when I want to update my lists.

I’m still sort of a newbie to text Explorer, so maybe My process for Corpus curation is just flat out wrong. I’d very much welcome any suggestions!
Highlighted
Craige_Hales
Staff (Retired)

Re: Can I stop Text Explore from re-running after each add phrase/add stopword/or add recode action?

@ErnestPasour 

Craige
Highlighted
caseylott
Level III

Re: Can I stop Text Explore from re-running after each add phrase/add stopword/or add recode action?

I think I may have (mostly) solved this problem, at least in a way that seems like it will work for me. The key was saving the term list, phrase list, and stemming report as data tables. Then, I could work in these tables to add an indicator column for whether or not I want to add the term, phrase, or stem to the "Manage stop words", or "manage phrases" interfaces. Once I finished flagging terms, phrases, or stems in these tables, I could subset each of them by the "yeses" and then paste these directly into the column property box of the "Manage x" interface. This way, I don't lose time waiting for the refresh after I make each decision. 

 

Also, I took a random sample of 1000 records from my larger corpus before doing any of this work. I did curation on this sample and then copied the final column properties to the full data set. This way, when the refresh did run, it didn't take as long. Still... it would be nice to be able to have a platform preference to turn off the automatic refresh and run it only when desired, perhaps as a simple refresh button.

View solution in original post

Article Labels

    There are no labels assigned to this post.