cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Browse apps to extend the software in the new JMP Marketplace
Choose Language Hide Translation Bar
FR60
Level IV

remove almost empty categorical columns

Some one can help me?

From a database i need to remove categorical columns with a certain percentage of missing character and at the same time I would like to remove the numeric columns with a percentage of stagnant values (not necessary 100%).

 

Thanks.  Felice  

1 ACCEPTED SOLUTION

Accepted Solutions
msharp
Super User (Alumni)

Re: remove almost empty categorical columns

Yeah it all depends on how you set it up. The first check is finding the most common value in each column and determines if it is above or below the threshold percentage. Unfortunately, a missing value isn't a value, so it can't be the most common. That's why it doesn't work on the numeric columns. It does work on the Character columns b/c "" is both missing and an empty string value.

Hope that helps explain things.

If you want more granularity, just run multiple loops with different percentages using different strategies.

View solution in original post

9 REPLIES 9
txnelson
Super User

Re: remove almost empty categorical columns

Here is a quick and dirty script that gets rid of columns with less than a specified percent of data

Names Default To Here( 1 );
dt = Current Data Table();
percent = 90;
colList = dt << get column names( character, string );
NumRows = N Rows( dt );
For( i = N Items( colList ), i >= 1, i--,
	If( Col Number( Column( dt, i ) ) / NumRows * 100 < percent,
		dt << delete columns( colList[i] )
	)
);
Jim
msharp
Super User (Alumni)

Re: remove almost empty categorical columns

Another method, I believe slightly more robust.

 

Names Default To Here( 1 );
dt = Current Data Table();
percent = 90;
delcols = {};
numRows = nrows(dt);
for(i=1; maxi=NCols(dt), i<=maxi, i++,
	col = column(dt, i);
	nmissing = col N missing(col);
	if(nmissing > numRows * percent/100,
		insert into(delcols, i);
	);
);
if(nitems(delcols) > 0, dt << delete columns(delcols));

 

msharp
Super User (Alumni)

Re: remove almost empty categorical columns

I think you are also looking for something like this:

Which will delete a numeric column if all data is only one value.

Names Default To Here( 1 );
dt = Current Data Table();
delcols = {};
numRows = nrows(dt);
for(i=1; maxi=NCols(dt), i<=maxi, i++,
	col = column(dt, i);
	try(summarize(dt, max = max(col), min = min(col));
		if((max == min)  | (ismissing(max)),
			insertinto(delcols, i);
		)
	);
);
if(nitems(delcols) > 0, dt << delete columns(delcols));
mpl34
Level III

Re: remove almost empty categorical columns

you might be able to use the method you posted before but with the mode of the column for something less than 100%

Names Default To Here( 1 );
dt = Current Data Table();
percent = 90;
delcols = {};
numRows = nrows(dt);
for(i=1; maxi=NCols(dt), i<=maxi, i++,
	if(nrows(loc(aslist(column(dt,i)<<getasmatrix),mode(column(dt,i)<<getasmatrix))) > numRows * percent/100,
		insert into(delcols, i);
	);
);
if(nitems(delcols) > 0, dt << delete columns(delcols));  

 

FR60
Level IV

Re: remove almost empty categorical columns

Hi

I tried yor script on the second table below and the result is here. 

As you can see it was able to remove almost all desired columns but second one (col2) not. How I can do to remove it too?

Thanks  Felice

 

Picture2.pngPicture1.png

msharp
Super User (Alumni)

Re: remove almost empty categorical columns

Just combine the strategies.  Something like:

Names Default To Here( 1 );
dt = Current Data Table();
percent = 90;
delcols = {};
numRows = nrows(dt);
for(i=1; maxi=NCols(dt), i<=maxi, i++,
	if(nrows(loc(aslist(column(dt,i)<<getasmatrix),mode(column(dt,i)<<getasmatrix))) > numRows * percent/100,
		insert into(delcols, i);
	,
		if(Col N Missing(column(dt,i)) > numRows * percent/100,
			insert into(delcols, i);
		);
	);
);
if(nitems(delcols) > 0, dt << delete columns(delcols));  
FR60
Level IV

Re: remove almost empty categorical columns

Hi msharp thanks for your reply. Let'm ask another question. The percent variable work for both on categorical and numerical columns? If yes can I have two different value for missing and stagnant cases?

 

Sorry for maybe stupid question but I'm not expert in jsl language.

 

Felice 

msharp
Super User (Alumni)

Re: remove almost empty categorical columns

Yeah it all depends on how you set it up. The first check is finding the most common value in each column and determines if it is above or below the threshold percentage. Unfortunately, a missing value isn't a value, so it can't be the most common. That's why it doesn't work on the numeric columns. It does work on the Character columns b/c "" is both missing and an empty string value.

Hope that helps explain things.

If you want more granularity, just run multiple loops with different percentages using different strategies.
FR60
Level IV

Re: remove almost empty categorical columns

Perfect thanks.

 

Felice