Choose Language Hide Translation Bar
Highlighted
caseylott
Level III

Can I use JMP to parse full-text journal articles into separate columns for methods, results, discussion?

Hello all,

 

I am interested in doing some text analyses on how keywords/phrases vary among thematic sections in scientific journal articles. I am wondering if it is possible to start with a pdf and then generate separate columns for the text chunks contained in each of the identifiable sections of the article's structure. For example, could I create 8 different columns with text chunks from the title, abstract, keywords, introduction, methods, results, discussion, and literature cited sections? 

 

My starting point for this could either: a) a folder of pdfs or b) a jmp table where each row is an article and there is a column that contains all text from each article that I've already created in JMP. Any suggestions? Note: I have only JMP, not JMP Pro, so I need to figure out how to do this with this constraint.

 

Thanks to anyone who can provide insight.   

1 ACCEPTED SOLUTION

Accepted Solutions
Highlighted

Re: Can I use JMP to parse full-text journal articles into separate columns for methods, results, discussion? JMP TABLE ATTACHED

I do not think manually processing so many documents is feasible!

 

Don't give up so soon. The Community loves a challenge!

Learn it once, use it forever!

View solution in original post

6 REPLIES 6
Highlighted

Re: Can I use JMP to parse full-text journal articles into separate columns for methods, results, discussion?

Yes, with the Multiple File Import command and the Text Explorer.

 

The formation of the term list is based on Basic Words (easy but simple) or Regular Expressions (more difficult but more specific). The regex method would be better for finding specific information from the document. You can customize the regex that is applied to the documents. You can select phrases and add them to the term list. You can then save indicator or frequency columns back to the original data table.

 

This data table could also be made with table commands and column formulas that use regex, JMP text patterns, and other character string functions.

Learn it once, use it forever!
Highlighted
caseylott
Level III

Re: Can I use JMP to parse full-text journal articles into separate columns for methods, results, discussion?

Hi @markbailey and thanks for the reply. I understood part of your response, but maybe not all of it. Here is what I know how to do already: 1) import a bunch of pdfs into a jmp table so that each article is one row and all of the article's text is in a single column; 2) explore this Corpus using standard Text Explorer functions (add phrases to term lists, create word clouds, create document term matrices, etcetera).

 

What I don't know how to do is to separate each documents text from this one column into multiple columns, called, for example, Introduction (with just the text from an article's introduction), Abstract (with just the text from the articles abstract), and so on. I was just experimenting with a potential way to do this using Text to Columns, where I find and replace the word "Introduction") with "Introduction (followed here by a wacky character that doesn't occur anywhere in the corpus)". Then, I split text to columns using the wacky character. This would work great if the word "introduction" or "result" only occurred once in each document. When it occurs more than once I get misaligned columns across articles. The only way I can think of fixing this is evaluating each case of the word "Introduction" during the find and replace operation, only inserting the delimiter character when the word Introduction is being used as a true section break. Since I am an ecologist and pretty new to text analysis, this is pretty typical of the sorts of inefficient approaches I've been jury-rigging for problems like this.

 

Would you mind giving an example of how one might use any other table command or regex to parse the single column with all text into multiple columns with text for different content sections? Note: I am a true beginner with regex, so assume I have next to no knowledge on this topic and I'll start reading up.

 

Thanks again. Casey  

Highlighted
caseylott
Level III

Re: Can I use JMP to parse full-text journal articles into separate columns for methods, results, discussion? JMP TABLE ATTACHED

Hi all,

 

I've attached a JMP table to this post that illustrates where I am stuck at this point. I'd like to use the text in the "Lowercase and collapsed whitespace Text" column to create the text chunks in the empty columns.

 

I understand that this process will probably be challenging and I'll have to deal with a lot of special cases (e.g., papers where section headings differ, e.g., "introduction" versus "background"). I started looking at character and character pattern formulas, which seems pretty promising, and got overwhelmed pretty quickly. My background is just with analyzing numbers, not text, so I'm still catching up. If anyone could point me towards a few character formulas that might be particularly useful here, I'd be grateful.

Highlighted

Re: Can I use JMP to parse full-text journal articles into separate columns for methods, results, discussion? JMP TABLE ATTACHED

I think that this project is not going to be easy. It might not be appropriate for a discussion here due to the length and complexity of the problem. But let's see what we can do.

 

My experience with text analysis has always been with a corpus from a single population. For example, I used narratives about patient admission to emergency departments in 100 hospitals to write a course based on FDE two years ago.That case presented myriad challenges but at least there was enough consistency across the documents to make the set of regex possible to accomplish the goals of that text analysis. You do not really want text analysis (i.e., document term matrix) at this point. You want to parse the documents into their sections.

 

I think that a solution will require a great deal of domain expertise. I randomly picked several documents and could not find anything in just a handful of documents that I might use as the basis for parsing the sections that you want. There seem to be many conventions at work and many cases that don't seem to follow any convention at all.

 

Regex might not be the right level of processing in this case. JMP Patterns are very powerful. I understand them at a basic level but I am not an expert. Perhaps others in the Community have the experience necessary to take it apart.

Learn it once, use it forever!
Highlighted
caseylott
Level III

Re: Can I use JMP to parse full-text journal articles into separate columns for methods, results, discussion? JMP TABLE ATTACHED

Thanks, @markbailey . I think you are right. It would probably be a difficult project even with a bunch of domain experience, which I don't have. In my case, I have about 5,700 pdfs. Monkeying my way through these in a brute-force, mostly manual way, is probably the most feasible alternative for me right now. Thank you for taking the time to respond.

Highlighted

Re: Can I use JMP to parse full-text journal articles into separate columns for methods, results, discussion? JMP TABLE ATTACHED

I do not think manually processing so many documents is feasible!

 

Don't give up so soon. The Community loves a challenge!

Learn it once, use it forever!

View solution in original post

Article Labels

    There are no labels assigned to this post.