turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- JMP User Community
- :
- Discussions
- :
- remove almost empty categorical columns

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

Mar 16, 2017 10:18 AM
(2765 views)

Some one can help me?

From a database i need to remove categorical columns with a certain percentage of missing character and at the same time I would like to remove the numeric columns with a percentage of stagnant values (not necessary 100%).

Thanks. Felice

1 ACCEPTED SOLUTION

Accepted Solutions

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

Mar 17, 2017 9:41 AM
(5412 views)

Solution

Yeah it all depends on how you set it up. The first check is finding the most common value in each column and determines if it is above or below the threshold percentage. Unfortunately, a missing value isn't a value, so it can't be the most common. That's why it doesn't work on the numeric columns. It does work on the Character columns b/c "" is both missing and an empty string value.

Hope that helps explain things.

If you want more granularity, just run multiple loops with different percentages using different strategies.

Hope that helps explain things.

If you want more granularity, just run multiple loops with different percentages using different strategies.

9 REPLIES

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

Mar 16, 2017 10:34 AM
(2763 views)

Here is a quick and dirty script that gets rid of columns with less than a specified percent of data

```
Names Default To Here( 1 );
dt = Current Data Table();
percent = 90;
colList = dt << get column names( character, string );
NumRows = N Rows( dt );
For( i = N Items( colList ), i >= 1, i--,
If( Col Number( Column( dt, i ) ) / NumRows * 100 < percent,
dt << delete columns( colList[i] )
)
);
```

Jim

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

Mar 16, 2017 12:08 PM
(2757 views)

Another method, I believe slightly more robust.

```
Names Default To Here( 1 );
dt = Current Data Table();
percent = 90;
delcols = {};
numRows = nrows(dt);
for(i=1; maxi=NCols(dt), i<=maxi, i++,
col = column(dt, i);
nmissing = col N missing(col);
if(nmissing > numRows * percent/100,
insert into(delcols, i);
);
);
if(nitems(delcols) > 0, dt << delete columns(delcols));
```

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

Mar 16, 2017 12:12 PM
(2755 views)

I think you are also looking for something like this:

Which will delete a numeric column if all data is only one value.

```
Names Default To Here( 1 );
dt = Current Data Table();
delcols = {};
numRows = nrows(dt);
for(i=1; maxi=NCols(dt), i<=maxi, i++,
col = column(dt, i);
try(summarize(dt, max = max(col), min = min(col));
if((max == min) | (ismissing(max)),
insertinto(delcols, i);
)
);
);
if(nitems(delcols) > 0, dt << delete columns(delcols));
```

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

Mar 16, 2017 6:06 PM
(2744 views)

you might be able to use the method you posted before but with the mode of the column for something less than 100%

Names Default To Here( 1 ); dt = Current Data Table(); percent = 90; delcols = {}; numRows = nrows(dt); for(i=1; maxi=NCols(dt), i<=maxi, i++, if(nrows(loc(aslist(column(dt,i)<<getasmatrix),mode(column(dt,i)<<getasmatrix))) > numRows * percent/100, insert into(delcols, i); ); ); if(nitems(delcols) > 0, dt << delete columns(delcols));

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

Mar 17, 2017 7:52 AM
(2726 views)

Hi

I tried yor script on the second table below and the result is here.

As you can see it was able to remove almost all desired columns but second one (col2) not. How I can do to remove it too?

Thanks Felice

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

Mar 17, 2017 8:20 AM
(2721 views)

Just combine the strategies. Something like:

```
Names Default To Here( 1 );
dt = Current Data Table();
percent = 90;
delcols = {};
numRows = nrows(dt);
for(i=1; maxi=NCols(dt), i<=maxi, i++,
if(nrows(loc(aslist(column(dt,i)<<getasmatrix),mode(column(dt,i)<<getasmatrix))) > numRows * percent/100,
insert into(delcols, i);
,
if(Col N Missing(column(dt,i)) > numRows * percent/100,
insert into(delcols, i);
);
);
);
if(nitems(delcols) > 0, dt << delete columns(delcols));
```

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

Mar 17, 2017 9:02 AM
(2712 views)

Hi msharp thanks for your reply. Let'm ask another question. The percent variable work for both on categorical and numerical columns? If yes can I have two different value for missing and stagnant cases?

Sorry for maybe stupid question but I'm not expert in jsl language.

Felice

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

Mar 17, 2017 9:41 AM
(5413 views)

Hope that helps explain things.

If you want more granularity, just run multiple loops with different percentages using different strategies.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

Mar 17, 2017 9:53 AM
(2704 views)