Current events spur a thought experiment
I saw a headline in the news the other day about a professor who failed half the students in a class for using an AI to write their final essays. It turns out that the headline was much more exciting than the actual story: in an ironic twist, the professor misused AI in the effort to detect the use of AI. Anyway, that got me thinking, can JMP determine who wrote a body of text – and can it do this for someone who doesn't know much about text analysis? It turns out JMP Pro can, as shown in the graph below:
Getting text to analyze
Since I don't have access to a bunch of student papers, I used Project Gutenberg, which hosts books whose copyright has expired. I chose six books by five authors as candidates:
- "The Great Gatsby" by F. Scott Fitzgerald
- "Moby Dick" by Herman Melville
- "Frankenstein" by Mary Shelley
- "The Call of the Wild" by Jack London
- "The Picture of Dorian Gray" by Oscar Wilde
- "The Importance of Being Earnest" by Oscar Wilde
Either an American or British author wrote each of these books, and their original language was English. I made this choice because I didn't want to ruin things by accidentally choosing books whose translators were the same person.
Importing and cleanup
Importing the text is easy: just put all the books as individual text files in a single folder and use the Multiple File Import tool as pictured below. I made sure to select the options to add a column for the file name, to add one line per row, and to stack similar files. The end result is one data table with about 53,000 rows.
Data cleanup involved removing blank lines and the header information at the beginning of each book, as well as creating a column name for the author of each book. I leaned on the built-in formula columns heavily for this, especially the First Word formula column. I attached a copy of the final data file as Cleaned Text.jmp, so anyone reading should be able to follow the analysis.
Running the analysis
To do the analysis, select Analyze...Text Explorer, then place the Text and File Name columns in the dialog as shown in the picture below. Leave all other options at their defaults; select OK to continue.
Next, from the Red Triangle menu for the Text Explorer, choose Discriminant Analysis and, in the first dialog box that shows up, select the Author column to tell JMP Pro that you want to determine which author wrote each line. Note that Text Discriminant Analysis is only available in JMP Pro.
Click OK in the second dialog box (pictured below) to accept the defaults and start the analysis. This step might take about a minute or so to complete, as JMP does math in the background.
Initial results to final results
In the Classification Summary section, things don't look very promising at first – the model misclassifies 64.5% of the lines. But keep in mind that each of the lines is, at most, 80 characters of text, and that each of the books is thousands of lines. We need a way to look at the big picture, which involves spending time with the Predicted Rate table below the Misclassification Rate section.
Right-clicking on the Predicted Rate table and selecting Make into Data Table, followed by a quick stacking of the data table, creates the table pictured below (and attached as Classification Summary - Stacked.jmp).
Using Graph Builder, a heat map of Predicted Author vs Author shows that the highest prediction for each author is correct, with the exception of Oscar Wilde in the upper-right corner. This is good, but not perfect.
Another platform that can be helpful here is the Analyze...Multivariate Methods...Multiple Correspondence Analysis. Running it as shown below creates a very helpful graph, which shows that the Predicted Author is very close to the Actual Author. Oscar Wilde is, again, not as close of a prediction as the others, but it's close enough to be able to make a good decision.
Conclusion and further thoughts
While the Discriminant Analysis tool didn't appear at first to have predicted cleanly which author wrote each book, a little further exploration of the output showed much more promising results: without needing to understand options in the Text Explorer or the Discriminant Analysis, JMP Pro was able to create a good prediction of which author wrote each book. The uncertainty around Oscar Wilde's prediction a bit greater than the others, but might be due to:
- Having two of his books in the data set.
- Having simply accepting the default settings – perhaps the model could be tuned, or
- The fact that "The Importance of Being Earnest" was a play and not a novel, a fact I only discovered after doing the analysis.
My guess is Option C is the reason. It could be fun to repeat this experiment with other authors and other sample books. I'm interested in any results you have if you re-run the experiment. Please post your results in the comments!
Classification Summary - Stacked.jmp