Hi again,
After some experimentation, I found a way to accomplish my journal article text parsing goal using the simple formula provided by @Thierry_S . The challenge in my case, was that my "articles" are from a huge number of different journals, all of whom have different formats and subheading labels. Consequently, no standard pattern-oriented approach would work. What I ended up doing required some manual editing of pdfs, but it wasn't that bad and it seems to be 100% effective.
First, I opened each pdf and edited text so that the different standard content blocks had their own delimiters. In my case, I wanted standard categories of abstract, introduction, methods, results, discussion, and literature cited. So, within each paper, I found the first heading that matched my concept of "methods". In some cases, this was actually the word "methods", but in other cases it was something like "materials and methods" or "study area". Either way, at the end of the desired subheading I added a unique delimiter (e.g., CCCCC). Then, I found the first subheading that matched my concept of "results", which was often the word "results", but every journal is different, and sometimes, other words were used. Regardless of the subheading, I added a unique delimiter (e.g., DDDDD). I did the same for each of the six standard content blocks I wanted across my full corpus.
Next, I converted all of my edited pdf files to .txt files (using different software, outside of JMP).
Then, I used the AWESOME "Import multiple files" tool in JMP to create a data table that had one row per article with each article's full text in a single column.
At this point, all I had to do was create 6 new columns, one for each heading, and use a variation of the formula provided by Thierry_S to collect my desired content chunks. For example, to create a "methods" column, I used the formula
Munger(
:TEXT,
Contains( Uppercase( :TEXT ), "CCCCC" ),
Contains( Uppercase( :TEXT ), "DDDDD" )
-Contains( Uppercase( :TEXT ), "CCCCC" )
)
Then, to create a "results" column, I used the formula
Munger(
:TEXT,
Contains( Uppercase( :TEXT ), "DDDDD" ),
Contains( Uppercase( :TEXT ), "EEEEE" )
-Contains( Uppercase( :TEXT ), "DDDDD" )
)
In the end, I have a data table for 800 articles where each article has information in the standard columns of abstract, introduction, methods, results, discussion, and literature cited. BINGO, Hallelujah!
While this approach did require the manual step of inserting my delimiters into PDF files, that process only took a day. Given the enormous variation in article formatting, and the absence of consistent "tag" metadata across articles, this was a pretty fool proof way to go. Maybe there is a way to insert these tags automatically, by searching for a list of words that might meet the section criteria (e.g., "methods", "results"), but as mentioned elsewhere in this thread, those are very common words in journal articles that are likely to occur in just about any section. The manual approach allowed me to capture text from all my pdfs, whether they originated as pdf files from journals or poorly scanned reports that have been subjected to OCR over the past 30 years.
Once again, thank you @Thierry_S for your post. The flexibility of JMP strikes again!
Casey