cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Try the Materials Informatics Toolkit, which is designed to easily handle SMILES data. This and other helpful add-ins are available in the JMP® Marketplace
Choose Language Hide Translation Bar
Predicting book authors in JMP® Pro

Current events spur a thought experiment

I saw a headline in the news the other day about a professor who failed half the students in a class for using an AI to write their final essays. It turns out that the headline was much more exciting than the actual story: in an ironic twist, the professor misused AI in the effort to detect the use of AI. Anyway, that got me thinking, can JMP determine who wrote a body of text – and can it do this for someone who doesn't know much about text analysis? It turns out JMP Pro can, as shown in the graph below:

Jed_Campbell_0-1684954173762.png


Getting text to analyze

Since I don't have access to a bunch of student papers, I used Project Gutenberg, which hosts books whose copyright has expired. I chose six books by five authors as candidates:

  • "The Great Gatsby" by F. Scott Fitzgerald
  • "Moby Dick" by Herman Melville
  • "Frankenstein" by Mary Shelley
  • "The Call of the Wild" by Jack London
  • "The Picture of Dorian Gray" by Oscar Wilde
  • "The Importance of Being Earnest" by Oscar Wilde

Either an American or British author wrote each of these books, and their original language was English. I made this choice because I didn't want to ruin things by accidentally choosing books whose translators were the same person.

Importing and cleanup

Importing the text is easy: just put all the books as individual text files in a single folder and use the Multiple File Import tool as pictured below. I made sure to select the options to add a column for the file name, to add one line per row, and to stack similar files. The end result is one data table with about 53,000 rows.

MFI.png

Data cleanup involved removing blank lines and the header information at the beginning of each book, as well as creating a column name for the author of each book. I leaned on the built-in formula columns heavily for this, especially the First Word formula column. I attached a copy of the final data file as Cleaned Text.jmp, so anyone reading should be able to follow the analysis.

Running the analysis

To do the analysis, select Analyze...Text Explorer, then place the Text and File Name columns in the dialog as shown in the picture below. Leave all other options at their defaults; select OK to continue.

Text Explorer.png

Next, from the Red Triangle menu for the Text Explorer, choose Discriminant Analysis and, in the first dialog box that shows up, select the Author column to tell JMP Pro that you want to determine which author wrote each line. Note that Text Discriminant Analysis is only available in JMP Pro.

Disc 1.png

Click OK in the second dialog box (pictured below) to accept the defaults and start the analysis. This step might take about a minute or so to complete, as JMP does math in the background.

disc 2.png


Initial results to final results

In the Classification Summary section, things don't look very promising at first – the model misclassifies 64.5% of the lines. But keep in mind that each of the lines is, at most, 80 characters of text, and that each of the books is thousands of lines. We need a way to look at the big picture, which involves spending time with the Predicted Rate table below the Misclassification Rate section.

classification summary.png

Right-clicking on the Predicted Rate table and selecting Make into Data Table, followed by a quick stacking of the data table, creates the table pictured below (and attached as Classification Summary - Stacked.jmp).

Jed_Campbell_1-1684955352612.png

Using Graph Builder, a heat map of Predicted Author vs Author shows that the highest prediction for each author is correct, with the exception of Oscar Wilde in the upper-right corner. This is good, but not perfect. 

Jed_Campbell_2-1684955612395.png

Another platform that can be helpful here is the Analyze...Multivariate Methods...Multiple Correspondence Analysis. Running it as shown below creates a very helpful graph, which shows that the Predicted Author is very close to the Actual Author. Oscar Wilde is, again, not as close of a prediction as the others, but it's close enough to be able to make a good decision.

Jed_Campbell_0-1684957058841.png

Jed_Campbell_1-1684957073710.png


Conclusion and further thoughts

While the Discriminant Analysis tool didn't appear at first to have predicted cleanly which author wrote each book, a little further exploration of the output showed much more promising results: without needing to understand options in the Text Explorer or the Discriminant Analysis, JMP Pro was able to create a good prediction of which author wrote each book. The uncertainty around Oscar Wilde's prediction a bit greater than the others, but might be due to:

  1. Having two of his books in the data set. 
  2. Having simply accepting the default settings – perhaps the model could be tuned, or
  3. The fact that "The Importance of Being Earnest" was a play and not a novel, a fact I only discovered after doing the analysis.

My guess is Option C is the reason. It could be fun to repeat this experiment with other authors and other sample books. I'm interested in any results you have if you re-run the experiment. Please post your results in the comments!

Last Modified: Aug 9, 2023 10:40 AM
Comments
chuck_boiler
Staff

This is fascinating, Jed.  Since the size of the 'Wilde' file was smaller than the others, I wonder if that might explain the slightly less accurate prediction.  Bigger sample, better prediction?  Again, this is very cool.  Using this method, you might unmask the ghost writer!

Jed_Campbell
Staff

@chuck_boiler That's definitely a possibility. I think the bigger factor is that "The Importance of Being Earnest" is a play, which means it's written style is very different than all the other works. Actually, as I think more about it, the matching is even more impressive, as it spanned genres.