cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Check out the JMP® Marketplace featured Capability Explorer add-in
Choose Language Hide Translation Bar

How do I identify numbers in a paragraph and extract as column in new tabel?

I have a text paragraph containing several numbers in different sentences and I would like to extract those numbers and get them stacked in a column in a new tabel.

The sentence looks like this: 

 

The bags are weighed on a mobile scale at storage location. Each 5. bag is spear sampled. The primary sample of A: 10 kg, B: 20 kg, C: 30 kg, 40 kg and E: 50 kg was dried for 24 hours at 105 C. The sample was subject to repeated splitting, leaving some 800-1000 grams which was further milled to pass a 100-mesh test sieve. The crushed sample is blended in a V-blender for 20 minutes and transferred to 100 ml waterproof plastic bottle and sealed.

 

In addition the number of different primary samples can vary from 1 up to 20 or 30.

Very thankful if somebody could point me in the right direction with a script. Tried to write it with the help of Copilot, but did not get the result I wanted.

Thanks a heap!

 

15 REPLIES 15
txnelson
Super User

Re: How do I identify numbers in a paragraph and extract as column in new tabel?

Here is one way to do this:

names default to here(1);
paragraph="The bags are weighed on a mobile scale at storage location. Each 5. bag is spear sampled. 
The primary sample of A: 10 kg, B: 20 kg, C: 30 kg, 40 kg and E: 50 kg was dried for 24 hours at 105 C. 
The sample was subject to repeated splitting, leaving some 800-1000 grams which was further milled to 
pass a 100-mesh test sieve. The crushed sample is blended in a V-blender for 20 minutes and transferred 
to 100 ml waterproof plastic bottle and sealed.";

theNumbers = {};
i = 1;
While( Word( i, paragraph, " ." ) != "",
	theWord = Word( i, paragraph, " ." );
	If( Is Missing( Num( theWord ) ) == 0,
		Insert Into( theNumbers, Num( theWord ) )
	);
	i++;
);
Show( theNumbers );

I am sure there is someone who will also provide a RegEx() solution.

Jim
Ressel
Level VI

Re: How do I identify numbers in a paragraph and extract as column in new tabel?

I tried expanding on your suggestion, but couldn't get it to work. Unfortunately, my findings are confusing me, so I hoped it was possible to get some more input, please.

 

The below snippet will find "10 kg". So far so good.

Names Default To Here( 1 );
text = "The bags are weighed on a mobile scale at storage location. Each 5. bag is spear sampled. 
The primary sample of A: 10 kg, B: 20 kg, C: 30 kg, 40 kg and E: 50 kg was dried for 24 hours at 105 C.
The sample was subject to repeated splitting, leaving some 800-1000 grams which was further milled to
pass a 100-mesh test sieve. The crushed sample is blended in a V-blender for 20 minutes and transferred
to 100 ml waterproof plastic bottle and sealed.";
// Use a regular expression to find all the numbers, including decimals matches = Regex( text, "\d+([.,]\d+)?\s*kg" ); Show( matches );

Let's try scripting this into a loop to find all weights associated with "kg" in the string. I can find all individual words, and I can combine them into pairs. Within these pairs, I thought it should be smooth sailing to find the pairs like "10 kg", "20 kg", "30 kg" and so on, but the code is stubbornly refusing to find any matches and put them into the matches={} list. 

I've tried moving the regexPattern inside the If function, and I've tried wrapping "combined" as well as the "regexPattern" variable in Eval(). 

Names Default To Here( 1 );
// the text text = "The bags are weighed on a mobile scale at storage location. Each 5. bag is spear sampled.
The primary sample of A: 10 kg, B: 20 kg, C: 30 kg, 40 kg and E: 50 kg was dried for 24 hours at 105 C.
The sample was subject to repeated splitting, leaving some 800-1000 grams which was further milled to
pass a 100-mesh test sieve. The crushed sample is blended in a V-blender for 20 minutes and transferred
to 100 ml waterproof plastic bottle and sealed.";
// break text into words and print each of them. works as expected words = Words( text ); For( i = 1, i <= N Items( words ), i++, Print( words[i] ) ); // regex pattern variable regexPattern = "\d+([.,]\d+)?\s*kg"; // empty list to store matches matches = {}; // loop over all words in text to create and print pairs of words (we're looking for "10 kg", "20 kg" and so on). works fine For( i = 1, i <= N Items( words ) - 1, i++, combined = words[i] || " " || words[i + 1]; Print( "Checking: " || combined ); );


// THIS PART DOESN'T WORK // loop once more over all words from text to create pairs of words and look for the regexPattern For( i = 1, i <= N Items( words ) - 1, i++, combined = words[i] || " " || words[i + 1]; // check if the current combined string matches the pattern (e.g., "10 kg") If( Regex Match( combined, regexPattern ), Insert Into( matches, combined ); Print( "Match found: " || combined ); , // else Print( "No match for: " || combined ); ); );

What am I overlooking?

mmarchandFSLR
Level IV

Re: How do I identify numbers in a paragraph and extract as column in new tabel?

I haven't gone over everything, but just from a cursory look at the script, "." is a wildcard, so you'd need to escape it.

Ressel
Level VI

Re: How do I identify numbers in a paragraph and extract as column in new tabel?

Thanks, but that's what I wanted to demonstrate with the first snippet: It does correctly identify the first match to "a number + kg" (i.e., "10 kg") in the example. If it works once, shouldn't it work for all pairs of tokens/words?

mmarchandFSLR
Level IV

Re: How do I identify numbers in a paragraph and extract as column in new tabel?

Oh, yes, sorry.  I should wake up and have my coffee before posting here.  I ran into this type of thing before and put up an answer (while I was at my last job) here.

Ressel
Level VI

Re: How do I identify numbers in a paragraph and extract as column in new tabel?

Haha, that's nothing to worry about! Always good to have friendly input

mmarchandFSLR
Level IV

Re: How do I identify numbers in a paragraph and extract as column in new tabel?

Good case for supporting hogi's wishlist item.

 

Regex add option for all flags

txnelson
Super User

Re: How do I identify numbers in a paragraph and extract as column in new tabel?

I don't know if what you are trying to do is to exercise RegEx, or if you are trying to actually find a solution of finding all references to kg values.  If the latter is the case, the below JSL is a very efficient solution

Names Default To Here( 1 );// the text
text = "The bags are weighed on a mobile scale at storage location. Each 5. bag is spear sampled. The primary sample of A: 10 kg, B: 20 kg, C: 30 kg, 40 kg and E: 50 kg was dried for 24 hours at 105 C. The sample was subject to repeated splitting, leaving some 800-1000 grams which was further milled to pass a 100-mesh test sieve. The crushed sample is blended in a V-blender for 20 minutes and transferred to 100 ml waterproof plastic bottle and sealed.";
// break text into words and print each of them. works as expected
words = Words( text );

// regex pattern variable
regexPattern = "\d+([.,]\d+)?\s*kg"; 

// empty list to store matches
matches = {};

potential = loc(words,"kg");

If( length(potential) > 0,
	for each( {kg}, potential,
		if( isMissing(words[kg-1])==0,
			insert into(matches, words[kg -1] || " kg")
		)
	)
);
Jim
Ressel
Level VI

Re: How do I identify numbers in a paragraph and extract as column in new tabel?

Actually, both. Thank you for your help and patience. I've never used "loc()", so that's good learning.

 

Copy pasting your entire code block and appending "Print( matches)" gives {"40 kg", "50 kg"}. This is consistent with "potential = Loc( words, "kg" );" only finding "[32, 36]". Interesting.

 

  1. Why is your solution not finding "10 kg", "20 kg" and "30 kg"?
  2. Why is my solution not finding anything?