Word Storm

Craige_Hales · May 5, 2019 08:24 PM

Video below, files attached. This project was done in about three stages: Data collection, a time consuming pre-processing step, and a video frame creation step.

The data was collected by wikipediaTextParser.jsl that downloaded a lot of Wikipedia web pages (done in several batches, over several days). There is a random button on the Wikipedia home page that has a link that returns a random article. JMP can open that link, in a loop, and write the HTML to a table. (If you do this, you should really consider donating the next time they ask.) The download loop also did some processing to remove a bunch of HTML and try to identify the interesting part of the document; you'll still see a few words in the video that should have been filtered. This step also searched for dates in a specific format to attempt to assign a date to each article; the oldest date seemed best since most articles get touched pretty often. When done, sort by date.

Generator2.jsl is the pre-processing step; it uses JMP's Text Explorer platform to look at subsets of the documents. I started with all the documents and made a list of stop words that are removed. Each subset is about 1500 documents and overlaps the previous subset, a lot. A table of terms is saved from each subset, and copied into a bigger table. The bigger table is 300 rows by 3600 columns. I'm keeping the 300 most popular terms, and I need 1800 frames to make 60 seconds of 30 frames/second video. Each pair of columns is a term and a frequency. Yes, if I did it over, I might turn it 90 degrees and have 600 columns by 1800 rows. This ran over night.

ProcessMC4.jsl, the final step, is mostly JSL to build the data to feed Graph Builder. I used "Use for Marker" to turn the term into a word in the graph and let Graph Builder figure out how big. There is a fair amount of JSL that runs a force-directed algorithm to help the words find a nice place to hang out. At the end of the video you can see the residual jittering after the data has stopped flowing but the force-directed code is still running. For all that, this still makes several frames a second. (I did a lot of tuning to find some force-directed parameters that worked pretty well. One of the surprises was discovering less-is-more when the words are looking at their nearest neighbors: 1 or 2 got nice clearance around the words, but 5 or more did not do so well. Perhaps I messed up the distance calculation when summing up the force vectors.)

Video (wants to be maximized, there are a lot of small words)

All videos