cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Try the Materials Informatics Toolkit, which is designed to easily handle SMILES data. This and other helpful add-ins are available in the JMP® Marketplace
Choose Language Hide Translation Bar
Challenge 10

One of the things I forgot to mention in my previous post was that the matrix lookup method adapts poorly to character data. The code in Set Operations using matrices.JSL contains the function mapString which converts a character string comprised of lower-case letters to a numeric value. This mapping slows things down considerably. In a related post@Adi-B asked about performing a find and replace on a data table column. Again, we are dealing with a mapping from one string to something else (in this case, another string).

This leads to this month’s challenge: What is the fastest way to map one set of values to a new set of values?

  1. For continuous data, it often makes sense to group the values into quantiles, deciles (i.e., 10 groups) possibly being most common. Write a function, quantileGroup(inVec,quant), that takes a vector and a positive integer greater than one and returns the quantile group to which the input values belong. For example, if

inVec = [1,2,3,4,5,6,7,8,9] and quant = 3

the function would return [1,1,1,2,2,2,3,3,3].

You will have to decide how to deal with values that straddle a quantile group boundary, particularly if there are duplicates.

  1. When dealing with many leveled categorical data, it frequently makes sense to aggregate levels with low counts into a single level. Create a function aggregateLowCounts(inList) that takes an input list of categorical values and returns a list that maps the values with high counts to themselves and the values with low counts into a single new group. How to determine the cutoff value for low counts is part of the challenge.
  2. Referring to the post mentioned above, create a function findAndReplace(inList,aaMap)that takes a list of strings and, for each item, does a mapping using an associative array supplied as the second argument, from one substring to a new substring. This mapping may be one‑to‑one or many-to-one.
  3. (Extra credit) Can searching a list of character strings be made faster than Loc? The purpose of mapString was to convert a string to a number so Loc Sorted, which uses a binary search, could be used. Loc Sorted is considerably faster than Loc. Try to create a function, locSortedString(stringList,inString) that finds the position of the string inString from the sorted list of strings stringList. It’s hard to imagine creating something faster than Loc without a priori knowledge of the possible string values that may occur. With this in mind, assume all strings will come from the word list we used in Challenge 1. It is attached below.

Good luck!

Last Modified: Dec 21, 2023 1:32 PM