Challenge 5

DonMcCormack · Oct 5, 2020 05:15 PM

My apologies for the delay on this write up. I have been very busy with, among other things, preparations for a talk on Functional Data Explorer at JMP Discovery Summit . It appears the regular contributors were equally preoccupied, as I only received two non-staff submissions for this challenge.

Here are point totals after the 5th Challenge:

Name	Points	Entries
@David_Burnham	65	5
@leelaim	50.5	5
@landon	38	3
@MathStatChem	28	2
@P_Anderson	22.5	2

Speaking of Discovery Summit, my colleague Stan Koprowski is organizing the JMP Scripting Unsession. Follow the link for more details; we’d love to see you there.

In Challenge 5, I asked that you write metafunctions, functions of functions. My motivation was to take functions that can only be applied to a single number or string and extend them so they can be applied to their composite values, vectors and lists. I have found this type of operation occurs somewhat regularly and can often be implemented with a Data Table Formula when a JSL function doesn't already exist. I wanted to have this functionality available outside of the Data Table.

Items 1 and 2 should have been relatively straightforward. When working with vectors, it is helpful to rely on what JSL has already implemented through preexisting JSL functions or matrix operations. The simplest approach to calculating the median absolute deviation (MAD) is the one line of code:

Median(Abs(inVec - Median(inVec) ) )

where inVec is the input vector. This solution depends on knowing that the median can be correctly generated from a built-in JSL function. The code follows directly from the definition of MAD.

Calculating the trimmed mean is a bit trickier and offers insight into the importance of providing a precise function definition (more of this later). A common approach was to calculate the average of the values bounded between the quantiles associated with pctLow and pctHigh, where pctLow and pctHi are the arguments as defined in the Challenge 5 post. For example:

loQ = Quantile(pctLow,inVec);

hiQ = Quantile(1-pctHi,inVec);

Mean(inVec[Loc(loQ < inVec < hiQ)]);

This solution, however, trims no more than pctLow from the low values and no more than pctHi from the high value. Because of this, the total percent trimmed may be less than pcLow + pctHi but never more. This approach is inconsistent with the one found in the Distribution platform which not only trims equally from both tails but, more importantly, guarantees the total percent trimmed is at least the amount requested (and may be more). The following code produces results consistent with this definition:

loRank = Ceiling(pctLo*N Row(inVec));

hiRank = Ceiling((1-pctHi)*N Row(inVec));

goodVals = (loRank < Ranking(inVec) <= hiRank);

goodVals`*inVec/Sum(goodVals);

For item 3, I was looking for a function that would find all the occurrences of a given string in every item in a list of strings. If the given string were “a” and the list of strings

{“calcaneoastragalar”,”fox”,”baby”}

the value returned would be

{[2, 5, 9, 13, 15, 17],.,[2]}

A correctly worded example would have helped in defining the function more precisely. My solution with looping:

listContains = Function({listIn,valueString},{i,outList,nextMat,start},
	outList  = Repeat({.},N Items(listIn));
	For(i=1,i<=N Items(listIn),i++,
		nextMat = [];
		start   = 1;
		While(start = Contains(listIn[i],valueString,start),
			nextMat = nextMat |/ start;
			start+=Length(valueString);
		);
		If(N Row(nextMat) > 0,outList[i] = nextMat,outList[i]=.);
	);
	outList;
);

Which takes between 2 and 3 seconds to process a list of 1 million randomly selected words from the dictionary in Challenge 1.

My impetus for applyFunction was to replicate functionality I had used in other programming languages, where I could apply (or map) a function onto an aggregate data structure (typically a list of strings or a matrix). I wanted to be able to use something like:

Length(listOfStrings)

to get the lengths of each item in the list rather than using a loop or column formula.

After posting the challenge, I realized that there were subtleties I did not consider, such as situations where the function could be applied to the individual items in the input list or the entire input list. Additionally, I said nothing about whether different function arguments could be applied to each list element.

With this in mind, and staying consistent with the comments in the Challenge 5 entry, the implementation of applyFunction I've provided takes four arguments:

listIn – a list of one or more numbers, strings, vectors, or lists of strings.

fn – the quoted name of the function to apply to listIn. The function can be built-in or user supplied.

argList – a list of arguments to fn that appear after listIn or a list of lists of arguments. If a list of lists is supplied, it must have the same cardinality as listIn.

flag - 1 if fn is applied to each element of the list, 0 if it is applied to the entire list.

A simple but general approach would be to build the functions using expressions and loops. In pseudocode:

If flag == 1 then
   If argList is a list of lists then
      For each item in listIn/argList, build an expression for fn
   Else
      For each item in listIn, build an expression for fn using the same items from argList
Else flag == 0
   Build one expression for listIn and argList

The code for this approach can be found in applyFunction (loops).JSL. @David_Burnham offers a recursive approach in challenge_5_burnham.JSL. While elegant (who doesn’t like recursion?) it encounters problems if the recursion goes too deep.

Trying to make applyFunction a loopless function turned out to be quite a challenge. The main problem is not knowing the number of items in argList a priori (or, even more challenging, dealing with situations where the number changes). At first, I was hoping I could accomplish this through crafty use of expression evaluation. For example, if the function only needed one argument, the following would work:

applyFunctionSimple = Function({listIn,fn,argList},{i,fnExpr,outList},
     i = 1;
     outList = {};
     outList[1::N Items(listIn)] = Substitute(
          Expr(func(in,arg)),
          Expr(func),Parse(fn),
          Expr(in),  Expr(listIn[i++]),
          Expr(arg), argList[1]
     );
     Eval List(outList)
);

The magic of Substitute starts by generating a list of length N Items(listIn) with each element of the list being the expression fn(listIn[i++],arg1), where fn and arg1 are given in the function calls and listIn[i++] is literal. When Eval List is applied to outList, however, each i is given a separate sequential value and the list item is evaluated, producing the desired results. I learned this approach to using expressions from @joseph_morgan who, along with @drewfoglia/@EvanMcCorkle have given excellent Discovery tutorials covering this topic. The tutorials can be found here and here.

Unfortunately, I wasn’t able to figure out how to use this approach entirely and had to rely on a technique that I try to avoid because of its inefficiency and inelegance, building a general function via string manipulation then using expression evaluation to get the output. My code is in applyFunction (no loops).JSL. While this code retains much of the flavor of applyFunctionSimple, Regex is used to strip the curly brackets from the expression of argList. Fortunately, @Craige_Hales provides a purely expression-based solution in Craige 5.JSL. Since it was sent to me before later Challenge 5 clarifications, it applies fn to each element in listIn only. In addition, the same set of arguments are used for every function call. Both this solution and mine share the same problem of being slower than our loop based solutions (between 2-5X).

Hopefully, this all makes sense. If not, or if you have additional insights, please post your comments below.

landon · ‎10-08-2020

Thanks @DonMcCormack for the detailed write-up! This challenge reminded me of the great things in MFR: a JSL Functional Programming Library for List/Associative Arrays Manipulation by @nascif_jmp. Hope some of these constructs make their way into base functions in JSL someday.

nascif_jmp · ‎10-09-2020

Thanks @DonMcCormack for the link to MFR. I don't want to spoil any (near) future announcements but yes, it is coming...

ErraticAttack · ‎08-06-2021

Just adding my two cents here as I'm building some functional programming type helper functions.

In my experience it's too easy to get a bit carried away and build an overly complicated solution that is both a bit slow and a bit cumbersome to understand.

Thus, here is a quick and easy implementation for map / filter / reduce that is nearly as fast as builtin methods (either loops or direct builtin calls) and has the added benefit of not requiring the use of the Expr(...) function to utilize them.

map = Function( {inputs/* list, function*/},
	/* uses a single underscore _ as the wild-card */
	{__i__, __result__, __list__, _},
	__result__ = {};
	__list__ = Arg( inputs, 1 );
	Eval(
		Substitute(
			Expr(
				For( __i__ = N Items( __list__ ), __i__ > 0, __i__--,
					Insert Into( __result__, _ = __list__[__i__]; __function__, 1 )
				)
			)
		,
			Expr( __function__ ), Arg( inputs, 2 )
		);
	);
	__result__
);

reduce = Function( {inputs/*list, function, seed = 0*/},
	/* uses a single underscore for current item, 'acc' for current value */
	{__i__, __list__, _, acc = 0},
	If( N Items( inputs ) == 3,
		acc = Arg( inputs, 3 )
	);
	__list__ = Arg( inputs, 1 );
	Eval(
		Substitute(
			Expr(
				For( __i__ = N Items( __list__ ), __i__ > 0, __i__--,
					_ = __list__[__i__];
					acc = __function__
				)
			)
		,
			Expr( __function__ ), Arg( inputs, 2 )
		)
	);
	acc
);

filter = Function( {inputs/*list, filter = _*/},
	/* uses a single underscore _ for current item */
	{__i__, __list__, __result__, _filter_, _},
	__result__ = {};
	If( N Items( inputs ) == 1,
		_filter_ = Expr( _ )
	,
		_filter_ = Arg( inputs, 2 )
	);
	__list__ = Arg( inputs, 1 );
	Eval(
		Substitute(
			Expr(
				For( __i__ = N Items( __list__ ), __i__ > 0, __i__--,
					_ = __list__[__i__];
					If( __filter__,
						__result__[N Items(__result__) + 1] = _
					)
				)
			)
		,
			Expr( __filter__ ), Name Expr( _filter_ )
		);
	);
	Reverse( __result__ )
);

filter({{1, 0, 2, Empty(), 3, .}});
lst = repeat( {{1, "a"}, {2, "b"}, {3, "c"}, {4, "d"}}, 10000 );
start = HP Time();
a = filter({lst, _[1] < 3} );
end = HP Time();
Write( "\!NMy Method: ", (end - start) / 1000000 );

reduce( {{"a", "b", "c"}, Uppercase( _ || acc ), seed = ""} );

list = Repeat( {Random Integer( 1, 100000000 )}, 50000 );

start = HP Time();
Show( reduce( {list, If( _ > acc, _, acc)} ) );
end = HP Time();
Write( "\!NMy Method: ", (end - start) / 1000000 );
start = HP Time();
Show( Max( list ) );
end = HP Time();
Write( "\!NBuiltin Method: ", (end - start) / 1000000 );

start = HP Time();
Show( reduce( {list, acc + _} ) );
end = HP Time();
Write( "\!NMy Method: ", (end - start) / 1000000 );
start = HP Time();
Show( Sum( list ) );
end = HP Time();
Write( "\!NBuiltin Method: ", (end - start) / 1000000 );


strings = Repeat( {"abcdefg"}, 1e5 ); // about a second
start = Tick Seconds();
substrings1 = map( {strings, Substr( _, 1 + 2, 1 )} );
stop = Tick Seconds();
Write( "\!n applyfunction time: ", stop - start );

start = Tick Seconds();
substrings2 = {};
For( i = N Items( strings ), i > 0, i--,
	Insert Into( substrings2, Substr( strings[i], 3, 1 ), 1 )
);
stop = Tick Seconds();
Write( "\!n JSL loop time: ", stop - start );

If( substrings1 == substrings2,
	Write( "\!nOK" ),
	Write( "\!noops" )
);

When I run this on my (rather slow) laptop, I get the following output:

/*My Method: 0.152577
reduce({list, If(_ > acc, _, acc)}) = 99994198;
My Method: 0.056897
Max(list) = 99994198;
Builtin Method: 0.024798
reduce({list, acc + _}) = 2495626010963;
My Method: 0.04259
Sum(list) = 2495626010963;
Builtin Method: 0.007765
 applyfunction time: 0.166666666656965
 JSL loop time: 0.0500000000029104
OK*/

Note that finding the max number in a bit list (50,000 entries) using the reduce function only took ~2x as long as the builtin Max() call, whereas the sum of that same list was ~6x slower. These are pretty good numbers given the added flexibility of the Map / Reduce / Filter functions.

DonMcCormack · ‎08-06-2021

Thanks @ErraticAttack. I like your approach!

Craige_Hales · ‎08-06-2021

@ErraticAttack - nice!