Solved: Can I subset only text from the Methods section of a batch of reports when each ...

caseylott · Jun 10, 2023 4:29 PM

Hi everyone,

I have a JMP table that includes the full text of a bunch of journal articles. Each each record has the article's title in one column (as an ID) and the articles full text in another. I am interested in creating a new column that only has the text from the Methods section of each paper. I started to think that this might be possible after watching the @brady_brady "Recode Review" tutorial https://community.jmp.com/t5/JMP-On-Air/Recode-Review/ta-p/261025 . My idea would be to search the full text field for each record for the word "methods" and then replace this with an arbitrary and uncommon delimiter (e.g., $%^). Next, I would search the full text field for each document for the word "results" and replace this with a different arbitrary delimiter (e.g., !@#). Then, I could somehow use either text to columns or RegEx to subset only the text between these two delimiters. This seems theoretically possible. However, I'm guessing it would be complicated (if not rendered impractical) when the words Methods or Results are present more than once in the document (which is often the case). Any suggestions/words of caution?

Thierry_S · Apr 29, 2021 01:21 AM

Hi,

Here is the most basic formula you could use to directly extract the text between the first occurrence of the word "Methods" and the first occurrence of the word "Results".

Munger(
	:TEXT,
	Contains( Uppercase( :TEXT ), "METHODS" ),
	Contains( Uppercase( :TEXT ), "RESULTS" )
	-Contains( Uppercase( :TEXT ), "METHODS" )
)

I'm pretty sure that this will only work for a subset of the full text data because these two words are extremely common in scientific papers, but it gives you an idea of how to use JMP formulas without having to insert special characters and use Regex.

Best,

TS

Thierry R. Sornasse

View solution in original post

brady_brady · Apr 29, 2021 10:20 AM

As in many (almost all?) of these cases, "it depends". If the data is well-structured, there may be other features you can use to account for this. For example, if the papers are all from the same journal, you might be able to take advantage of a structural consistency among them. Sometimes you really do have to be creative, and sometimes it is almost impossible to come up with a way to break out the text you want with an omnibus parsing rule. More tools enable more creativity and power, of course.

Be sure to view Recode's "extract segment" feature, and as you've mentioned, some Regex.

Sorry I can't be of more help here--

Cheers,

Brady

View solution in original post

caseylott · Apr 30, 2021 12:41 PM

Hi again,

After some experimentation, I found a way to accomplish my journal article text parsing goal using the simple formula provided by @Thierry_S . The challenge in my case, was that my "articles" are from a huge number of different journals, all of whom have different formats and subheading labels. Consequently, no standard pattern-oriented approach would work. What I ended up doing required some manual editing of pdfs, but it wasn't that bad and it seems to be 100% effective.

First, I opened each pdf and edited text so that the different standard content blocks had their own delimiters. In my case, I wanted standard categories of abstract, introduction, methods, results, discussion, and literature cited. So, within each paper, I found the first heading that matched my concept of "methods". In some cases, this was actually the word "methods", but in other cases it was something like "materials and methods" or "study area". Either way, at the end of the desired subheading I added a unique delimiter (e.g., CCCCC). Then, I found the first subheading that matched my concept of "results", which was often the word "results", but every journal is different, and sometimes, other words were used. Regardless of the subheading, I added a unique delimiter (e.g., DDDDD). I did the same for each of the six standard content blocks I wanted across my full corpus.

Next, I converted all of my edited pdf files to .txt files (using different software, outside of JMP).

Then, I used the AWESOME "Import multiple files" tool in JMP to create a data table that had one row per article with each article's full text in a single column.

At this point, all I had to do was create 6 new columns, one for each heading, and use a variation of the formula provided by Thierry_S to collect my desired content chunks. For example, to create a "methods" column, I used the formula

Munger(
	:TEXT,
	Contains( Uppercase( :TEXT ), "CCCCC" ),
	Contains( Uppercase( :TEXT ), "DDDDD" )
	-Contains( Uppercase( :TEXT ), "CCCCC" )
)

Then, to create a "results" column, I used the formula

Munger(
	:TEXT,
	Contains( Uppercase( :TEXT ), "DDDDD" ),
	Contains( Uppercase( :TEXT ), "EEEEE" )
	-Contains( Uppercase( :TEXT ), "DDDDD" )
)

In the end, I have a data table for 800 articles where each article has information in the standard columns of abstract, introduction, methods, results, discussion, and literature cited. BINGO, Hallelujah!

While this approach did require the manual step of inserting my delimiters into PDF files, that process only took a day. Given the enormous variation in article formatting, and the absence of consistent "tag" metadata across articles, this was a pretty fool proof way to go. Maybe there is a way to insert these tags automatically, by searching for a list of words that might meet the section criteria (e.g., "methods", "results"), but as mentioned elsewhere in this thread, those are very common words in journal articles that are likely to occur in just about any section. The manual approach allowed me to capture text from all my pdfs, whether they originated as pdf files from journals or poorly scanned reports that have been subjected to OCR over the past 30 years.

Once again, thank you @Thierry_S for your post. The flexibility of JMP strikes again!

Casey

View solution in original post

Thierry_S · Apr 29, 2021 01:21 AM

Hi,

Here is the most basic formula you could use to directly extract the text between the first occurrence of the word "Methods" and the first occurrence of the word "Results".

Munger(
	:TEXT,
	Contains( Uppercase( :TEXT ), "METHODS" ),
	Contains( Uppercase( :TEXT ), "RESULTS" )
	-Contains( Uppercase( :TEXT ), "METHODS" )
)

I'm pretty sure that this will only work for a subset of the full text data because these two words are extremely common in scientific papers, but it gives you an idea of how to use JMP formulas without having to insert special characters and use Regex.

Best,

TS

Thierry R. Sornasse

caseylott · Apr 30, 2021 12:41 PM

Hi again,

After some experimentation, I found a way to accomplish my journal article text parsing goal using the simple formula provided by @Thierry_S . The challenge in my case, was that my "articles" are from a huge number of different journals, all of whom have different formats and subheading labels. Consequently, no standard pattern-oriented approach would work. What I ended up doing required some manual editing of pdfs, but it wasn't that bad and it seems to be 100% effective.

First, I opened each pdf and edited text so that the different standard content blocks had their own delimiters. In my case, I wanted standard categories of abstract, introduction, methods, results, discussion, and literature cited. So, within each paper, I found the first heading that matched my concept of "methods". In some cases, this was actually the word "methods", but in other cases it was something like "materials and methods" or "study area". Either way, at the end of the desired subheading I added a unique delimiter (e.g., CCCCC). Then, I found the first subheading that matched my concept of "results", which was often the word "results", but every journal is different, and sometimes, other words were used. Regardless of the subheading, I added a unique delimiter (e.g., DDDDD). I did the same for each of the six standard content blocks I wanted across my full corpus.

Next, I converted all of my edited pdf files to .txt files (using different software, outside of JMP).

Then, I used the AWESOME "Import multiple files" tool in JMP to create a data table that had one row per article with each article's full text in a single column.

At this point, all I had to do was create 6 new columns, one for each heading, and use a variation of the formula provided by Thierry_S to collect my desired content chunks. For example, to create a "methods" column, I used the formula

Munger(
	:TEXT,
	Contains( Uppercase( :TEXT ), "CCCCC" ),
	Contains( Uppercase( :TEXT ), "DDDDD" )
	-Contains( Uppercase( :TEXT ), "CCCCC" )
)

Then, to create a "results" column, I used the formula

Munger(
	:TEXT,
	Contains( Uppercase( :TEXT ), "DDDDD" ),
	Contains( Uppercase( :TEXT ), "EEEEE" )
	-Contains( Uppercase( :TEXT ), "DDDDD" )
)

In the end, I have a data table for 800 articles where each article has information in the standard columns of abstract, introduction, methods, results, discussion, and literature cited. BINGO, Hallelujah!

While this approach did require the manual step of inserting my delimiters into PDF files, that process only took a day. Given the enormous variation in article formatting, and the absence of consistent "tag" metadata across articles, this was a pretty fool proof way to go. Maybe there is a way to insert these tags automatically, by searching for a list of words that might meet the section criteria (e.g., "methods", "results"), but as mentioned elsewhere in this thread, those are very common words in journal articles that are likely to occur in just about any section. The manual approach allowed me to capture text from all my pdfs, whether they originated as pdf files from journals or poorly scanned reports that have been subjected to OCR over the past 30 years.

Once again, thank you @Thierry_S for your post. The flexibility of JMP strikes again!

Casey

brady_brady · Apr 29, 2021 10:20 AM

As in many (almost all?) of these cases, "it depends". If the data is well-structured, there may be other features you can use to account for this. For example, if the papers are all from the same journal, you might be able to take advantage of a structural consistency among them. Sometimes you really do have to be creative, and sometimes it is almost impossible to come up with a way to break out the text you want with an omnibus parsing rule. More tools enable more creativity and power, of course.

Be sure to view Recode's "extract segment" feature, and as you've mentioned, some Regex.

Sorry I can't be of more help here--

Cheers,

Brady

Can I subset only text from the Methods section of a batch of reports when each report's full text is already in a JMP column?

Re: Can I subset only text from the Methods section of a batch of reports when each report's full text is already in a JMP column?

Re: Can I subset only text from the Methods section of a batch of reports when each report's full text is already in a JMP column?

Re: Can I subset only text from the Methods section of a batch of reports when each report's full text is already in a JMP column?

Re: Can I subset only text from the Methods section of a batch of reports when each report's full text is already in a JMP column?

Re: Can I subset only text from the Methods section of a batch of reports when each report's full text is already in a JMP column?

Re: Can I subset only text from the Methods section of a batch of reports when each report's full text is already in a JMP column?