Challenge 11

DonMcCormack · Nov 1, 2021 04:06 PM

My motivation for this month’s challenge was to have you think beyond the traditional definition of outliers as only being observations falling far from the center of the data (relative to the amount of data variability). In general, we can consider an outlier as a non-representative data point, something that doesn’t belong with the rest of the data. Here lies the true challenge: How do we define representative? The approach I provide below is my take on the problem.

The algorithm below groups observations into like and dissimilar sets based on distances between the points. Unlike the standard agglomerative approach to clustering, start with one group. Define a cut point as the value halfway between observations furthest apart, provided their distance exceeds three times the difference between the 75^th quantile and median of all non-zero distances (the distance spread). If there are no cut points, stop. Otherwise, replace the existing group with the two new groups. Repeat the above procedure with each new group, using the distance spread from the new group or the entire data set, whichever is larger. Continue until no group contains a cut point.

findOutliers = Function({vals},{Default Local},
		q1 = 3; 		//Spread Multiplier
		q2 = 0.75;  //Quantile to define upper bound of spread
		distSpread = 0;
		nMissing   = 0;
		
		tmpTbl = New Table("tmp",New Column("C1",Values(vals)),Private);
		sumTbl = tmpTbl << Summary(Group(:C1),Private);
		vals   = Column(sumTbl,1) << Get Values;
		freqs  = Column(sumTbl,2) << Get Values;
		If(N Row(Loc(vals)) == 1, Return([]));
		
		Close(tmpTbl,No Save);
		If(Is Missing(vals[1]),
			nMissing = freqs[1];
			freqs = freqs[2::N Row(vals)];
			vals  = vals[2::N Row(vals)];
		);
		vals      = Sort Ascending(vals[Loc Nonmissing(vals)]);
		groups    = Eval List({vals});
		cutPoints = [];

		While(N Items(groups) > 0,
			newGroups = {};
			For Each({nextVals},groups,			
				distances  = nextVals[2::N Row(nextVals)] - nextVals[1::(N Row(nextVals)-1)];
				distSpread = Max(distSpread,Median(distances[Loc(distances>0)]) + q1*(Quantile(q2,distances[Loc(distances>0)]) - Median(distances[Loc(distances>0)])));
				maxDist    = Maximum(distances);
				If(maxDist>distSpread,
					cutID = Loc(distances==maxDist)[1];
					cutPoints |/= (nextVals[cutID+1]+nextVals[cutID])/2;
					If(cutID == 1, 
						Insert Into(newGroups,nextVals[2::N Row(nextVals)]),//ELSE
						If(cutID == N Row(distances),
							Insert Into(newGroups,nextVals[1::cutID]),
						//ELSE
							Insert Into(newGroups,nextVals[1::cutID]);
							Insert Into(newGroups,nextVals[(cutID+1)::N Row(nextVals)]);
						)
					)
				);
			);
			groups = newGroups;
		);
		cutPoints = Sort Ascending(cutPoints);
		cutPoints;
	);

To start, a bit of housekeeping is needed to bypass data with a single value. Based on the results from Challenge 9, I’m using Summary to simultaneously find the unique values in the data and to sort it. Distances are found by offsetting the vector by one and taking their difference. The algorithm proceeds by collecting cut points and throwing out groups with no cut points. This continues until no groups are left.

The algorithm is fairly quick, needing about 1.6 seconds to process all data columns in the Probe sample data Data Table on my machine. At the top of the algorithm, I have added two variables to let you change the 3x multiplier and the 75^th quantile. In the script attached below, Challenge 11 Outlier Screening, I have included some additional window dressing to provide a bit of an interface, allowing column selection and generation of a report window containing summary information and histograms with the cut points drawn as reference lines.

I’ll be taking a bit of a hiatus from the JMP Challenge to work on another scripting project. I’ll be putting together written instructions and videos for small projects to help new and intermediate users improve their JSL skills. I’ll be focusing on common operations like putting tables together, accessing data from multiple sources, building custom reports, and so on. I’ll also try to work in some more advanced stuff, like expression handling and pattern matching. If there’s anything you think would be worthwhile, please let me know by posting them here, or sending me a Community message or email (don.mccormack@jmp.com).