Challenge 10 Results

DonMcCormack · Sep 1, 2021 10:27 AM

Item 1

The challenging part here in Item 1 of Challenge 10 is not the code but the intricacies around the definition of quantile. First, let's keep things distribution independent and work with the rank order of the values rather than the values directly. That is, we sort values in ascending order and number them according to where they appear in that order. We'll need to decide what to do with ties. We could continue ranking them sequentially or give them the same value. The drawback to the former is that we may have the same value appearing in more than one group. The drawback to the latter approach is that the number of items per group may be less balanced. These two approaches correspond to the Ranking and Ranking Tie JSL functions. We haven’t talked about what value to give the ties in the second approach. We could give them all the next sequential rank order value. Alternatively, we could average the sequentially ranked values. In both cases, we probably want to give the next sequential non-tied value the rank it would have received if there were no ties. Ranking Tie averages the ranks, and for the sake of simplicity we will use this.

The next question is how to divide the groups. We could try to divide them up as equally as possible. A simple way to do this would be to rank ties sequentially putting at least Floor(N/g) items in each group (where N it the total number of items and g is the number of quantile groups). We’ll have to decide where to put the N – g*Floor(N/g) < g additional values. The code below takes care of all of this simply and neatly:

Ceiling(Ranking(inVec)/(N Row(inVec)/quant))

where inVec is the input (column) vector and quant the number of quantile groups.

The code above may lead to tied values being split across groups. To avoid this, we can replace Ranking with Ranking Tie. An example of the full function is below:

quantileGroup = Function({inVec,quant=10,method=1},{outVec,sumTblRef,ranks},
	inVec  = Shape(inVec,N Row(invec)*N Col(inVec),1);
	If(method ==1,
		outVec = Ceiling(Ranking(inVec)/(N Row(inVec)/quant)),
		outVec = Ceiling(Ranking Tie(inVec)/(N Row(inVec)/quant))
	);
	outVec;
);

Item 2

There are a number of different ways we can aggregate low count groups: by absolute number of observations, relative number of observations (e.g., groups with counts less than 10% of the largest group), fixing the total number of groups, etc. Each has its merits and shortcomings. To provide a flexible solution, I choose to go with an approach that was both interactive and visual, managed with a bar chart that was adjustable using a slider. This way, I could move the slider back and forth until I found a distribution of groups that looked good. The difficulty was coming up with code that was fast when the number of groups was large. To make it simple, I wanted to work with Data Table operations and Platforms (with Automatic Recalc turned on). Below is an example of what this might look like before and after:

Before

After

Generally speaking, this code works by building a summary table from the groups and using the group counts (the N Rows column) as a Freq variable. This circumvents the need to update values from the original table, which may be considerably larger. The code can be found in Aggregate Groups.JSL. A list of the original group values from the summary is stored in the global variable groups in addition to a vector rank order of the counts (ranks). Moving the slider runs the aggregateLowCounts functions where the request number of groups (nGroups) is passed based on the position of the slider. The number of groups can be easily selected from ranks vector. For example, if 10 total groups are desired, the value for any original groups with a rank 10 or greater is set to the aggregated group value (e.g., “Other”) with all other groups remaining the same. Because the table is sorted, this merely requires changing values from row 10 onward.

Item 3 / Item 4

My motivation for this item was as a precursor to Item 4. Finding something in an Associative Array (AA) is worlds faster than using Loc. Building an AA with the 209,405 words from Word List.JMP took a bit under 12 seconds on my computer. The table below shows the time to find a list of random words from the table using either Loc or an AA.

N Words	Loc (sec)	Associative Array (sec)
1	0.19	0.000045
5	0.91	0.000127
10	1.78	0.000244
25	4.51	0.000592
50	9.11	0.001019
70	12.47	0.001454
100	17.82	0.002155

For lists longer than 70 words, it pays to build the AA. The code I used to test these two approaches is in Loc vs AA.JSL.

There are advantages and disadvantages to using AAs as searchable data structures. On the plus side, you can associate any value with an AA key, including other AAs. The drawbacks are that keys must be unique (no big deal in my option) and, more importantly, the search process must be augmented if search values are not in the list of keys. This is to say implementation is relatively easy when all possible search values are known a priori, but a bit trickier when they aren’t. This doesn’t mean we can’t use them; we just need to be a bit craftier. Something like the code below would work for the case mentioned in the post tagged in Item 3. A bit more cumbersome and likely slower in some cases, but extremely flexible. It can be easily adapted to situations where the lookup dictionary changes by just modifying the contents of the Associative Array.

aaMap = Associative Array(
	{"1","2","3","4","5","6","7","8","9","0","M","K"},
	{"1","2","3","4","5","6","7","8","9","0","e6","e3"}
);

findAndReplace = Function({inList,aaMap},
	outNums = Transform Each({nextString},inList,Output("Matrix"),
		nextChars = Items(nextString,"");
		Num(Concat Items(Transform Each({nextChar},nextChars,aaMap[nextChar]),""));
	);
	outNums
);

What if I wanted to tackle something a bit more general, bigger, like building a dictionary of all the words that appear in the English translation of Les Misérables? Let’s start by assuming a word is made up of only alphabetical characters, including possibly letters outside the standard ASCII character set (e.g., those with diacritics). Everything else is a delimiter. The challenge here is that we’re not sure what delimiting characters we might encounter. Since we know what characters make up words, we can assume if it isn’t in the list, it’s a delimiter.

Pattern matching has the perfect tools for finding the words we’ll use to build the dictionary: Pat Span and Pat Break. We could use Words, but then we would have to check each item for embedded delimiters. Using pattern matching lets us do everything in one step. The code is below. The entire process took between 8 and 9 seconds. To apply the code to a different text, some fine tuning may be in order, such as including additional characters and saving the file to a different encoding (I used utf-16 to capture extended ASCII characters). I am indebted to @Craige_Hales for helping me with this after I struggled to get it to work. The text file for Les Misérables was downloaded from the Project Gutenberg. You can find it here.

bigText  = Load Text File("Les Misérables.txt");
aaLookup = Associative Array();
letters  = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZé´É";
wordLoc  = 0;

Pat Match(bigText,
	Pat Repeat(
		(
			Pat Pos() >> nextLoc
			+ Pat Span( letters ) >> nextWord
			+ Pat Test(
				Try(
					Insert Into(aaLookup[Lowercase(nextWord)],nextLoc),
					aaLookup<<Insert(Lowercase(nextWord),Eval List({nextLoc})) 
				)
			 )
			| Pat Break( letters )) 
			+ Pat Fence() 
	)
);