Subscribe Bookmark RSS Feed

remove almost empty categorical columns

FR60

Occasional Contributor

Joined:

Jan 24, 2017

Some one can help me?

From a database i need to remove categorical columns with a certain percentage of missing character and at the same time I would like to remove the numeric columns with a percentage of stagnant values (not necessary 100%).

 

Thanks.  Felice  

1 ACCEPTED SOLUTION

Accepted Solutions
msharp

Super User

Joined:

Jul 28, 2015

Solution
Yeah it all depends on how you set it up. The first check is finding the most common value in each column and determines if it is above or below the threshold percentage. Unfortunately, a missing value isn't a value, so it can't be the most common. That's why it doesn't work on the numeric columns. It does work on the Character columns b/c "" is both missing and an empty string value.

Hope that helps explain things.

If you want more granularity, just run multiple loops with different percentages using different strategies.
9 REPLIES
txnelson

Super User

Joined:

Jun 22, 2012

Here is a quick and dirty script that gets rid of columns with less than a specified percent of data

Names Default To Here( 1 );
dt = Current Data Table();
percent = 90;
colList = dt << get column names( character, string );
NumRows = N Rows( dt );
For( i = N Items( colList ), i >= 1, i--,
	If( Col Number( Column( dt, i ) ) / NumRows * 100 < percent,
		dt << delete columns( colList[i] )
	)
);
Jim
msharp

Super User

Joined:

Jul 28, 2015

Another method, I believe slightly more robust.

 

Names Default To Here( 1 );
dt = Current Data Table();
percent = 90;
delcols = {};
numRows = nrows(dt);
for(i=1; maxi=NCols(dt), i<=maxi, i++,
	col = column(dt, i);
	nmissing = col N missing(col);
	if(nmissing > numRows * percent/100,
		insert into(delcols, i);
	);
);
if(nitems(delcols) > 0, dt << delete columns(delcols));

 

msharp

Super User

Joined:

Jul 28, 2015

I think you are also looking for something like this:

Which will delete a numeric column if all data is only one value.

Names Default To Here( 1 );
dt = Current Data Table();
delcols = {};
numRows = nrows(dt);
for(i=1; maxi=NCols(dt), i<=maxi, i++,
	col = column(dt, i);
	try(summarize(dt, max = max(col), min = min(col));
		if((max == min)  | (ismissing(max)),
			insertinto(delcols, i);
		)
	);
);
if(nitems(delcols) > 0, dt << delete columns(delcols));
mpl34

Community Trekker

Joined:

Feb 16, 2016

you might be able to use the method you posted before but with the mode of the column for something less than 100%

Names Default To Here( 1 );
dt = Current Data Table();
percent = 90;
delcols = {};
numRows = nrows(dt);
for(i=1; maxi=NCols(dt), i<=maxi, i++,
	if(nrows(loc(aslist(column(dt,i)<<getasmatrix),mode(column(dt,i)<<getasmatrix))) > numRows * percent/100,
		insert into(delcols, i);
	);
);
if(nitems(delcols) > 0, dt << delete columns(delcols));  

 

FR60

Occasional Contributor

Joined:

Jan 24, 2017

Hi

I tried yor script on the second table below and the result is here. 

As you can see it was able to remove almost all desired columns but second one (col2) not. How I can do to remove it too?

Thanks  Felice

 

Picture2.pngPicture1.png

msharp

Super User

Joined:

Jul 28, 2015

Just combine the strategies.  Something like:

Names Default To Here( 1 );
dt = Current Data Table();
percent = 90;
delcols = {};
numRows = nrows(dt);
for(i=1; maxi=NCols(dt), i<=maxi, i++,
	if(nrows(loc(aslist(column(dt,i)<<getasmatrix),mode(column(dt,i)<<getasmatrix))) > numRows * percent/100,
		insert into(delcols, i);
	,
		if(Col N Missing(column(dt,i)) > numRows * percent/100,
			insert into(delcols, i);
		);
	);
);
if(nitems(delcols) > 0, dt << delete columns(delcols));  
FR60

Occasional Contributor

Joined:

Jan 24, 2017

Hi msharp thanks for your reply. Let'm ask another question. The percent variable work for both on categorical and numerical columns? If yes can I have two different value for missing and stagnant cases?

 

Sorry for maybe stupid question but I'm not expert in jsl language.

 

Felice 

msharp

Super User

Joined:

Jul 28, 2015

Solution
Yeah it all depends on how you set it up. The first check is finding the most common value in each column and determines if it is above or below the threshold percentage. Unfortunately, a missing value isn't a value, so it can't be the most common. That's why it doesn't work on the numeric columns. It does work on the Character columns b/c "" is both missing and an empty string value.

Hope that helps explain things.

If you want more granularity, just run multiple loops with different percentages using different strategies.
FR60

Occasional Contributor

Joined:

Jan 24, 2017

Perfect thanks.

 

Felice