Subscribe Bookmark



May 28, 2014

Data Mining Poll Data Over the Years

The editor of the KD Nuggets newsletter, Gregory Piatetsky-Shapiro, Ph.D., has attracted quite a number of subscribers over the years with a variety of interesting news items, polls and topics. It's hard to believe he is now posting the 12th annual poll on Data Mining / Analytic Tools Used. Despite the many issues with the data from this admittedly unscientific poll, I was curious to bring this small time series data set into JMP and look at what poll participants had volunteered about their use of modeling tools over more than a decade.

Several of my statistician friends are quick to remind us that no meaningful conclusions can be drawn from such polls because they are fraught with bias and data quality issues. Vendor ballot-stuffing is a particular issue in this poll, which has been commented on in the 2001 and 2003 poll results.

First, what qualifies as a data mining tool is certainly a factor — many statisticians who have been doing predictive modeling for years may not consider themselves as doing data mining. Data mining/Analytics is a very broad cross-disciplinary area. [Full disclosure: Since JMP has included some data mining capabilities starting with JMP 6 (we are now at JMP 9 and also have 64-bit JMP Pro), I asked Gregory if JMP could be included in this year’s survey, and he kindly agreed.]

Second, given the number of new product entrants, exits, acquisitions as well as population/participant changes over time, there are considerable issues in the time dimension of this data. Some of this is evident in the missing data for various vendors. JMP’s Graph Builder shows variation in the number of votes versus percent of total votes each year by vendor.

Graph Builder in JMP displays KD Nuggets poll data

There are almost certainly some fat-fingering mistakes I made in collecting the data from the past poll results on And this is by no means an exhaustive list of issues with the data.

Data issues aside, many still find it interesting to look at what information people volunteer and how that may be changing over time. We are by nature curious. I also combined tools in an attempt to reflect acquisitions at a vendor/tool-provider level while still keeping the detailed votes at the product level (some assumptions had to be made given repackaging/naming, but all appears to be directionally correct).

From the bubble plot here (output as Flash, opens in a new window), you can play the “data movie” to see a changing blend of colors over time since the bubbles are colored by the tool provider/vendor. The open source tools are showing strong growth, which is consistent with what we observe from other sources, most notably a similar poll done by Karl Rexer of Rexer Analytics. Last year’s results of his poll are available. Now in its fifth year, active survey links and access codes are on Rexer Analytics and on Dean Abbott’s blog.

Back to the KD Nuggets poll results, using “your own code” could mean commercial, open source or a combination thereof (as is the case with JMP and R, with a growing number of examples of JMP Scripting Language and R on the JMP File Exchange). Many commercial software tools are also showing recent growth. This is consistent with the well-observed trend of organizations leveraging analytics more to create more value. By the way, if you want to volunteer your perspectives — here on this blog or in either of these popular data mining/analytics polls, we invite you to participate.