BookmarkSubscribeSubscribe to RSS Feed



May 21, 2014

Choose Language Hide Translation Bar
Improve your Numbersense with this new book

As I mentioned in an earlier post, Kaiser Fung has a new book out: Numbersense: How to Use Big Data to Your Advantage. I've read it and enjoyed it. It's helped me become a more critical consumer of analytical information, which is the aim of Kaiser's book.

Here's an excerpt from the book:

In analyzing data, there is no way to avoid having theoretical assumptions. Any analysis is part data, and part theory. Richer data lends support to many more theories, some of which may contradict each other, as we noted before. But richer data does not save bad theory, or rescue bad analysis. The world has never run out of theoreticians; in the era of Big Data, the bar of evidence is reset lower, making it tougher to tell right from wrong.

People in industry who wax on about Big Data take it for granted that more data begets more good. Does one have to follow the other?

When more people are performing more analyses more quickly, there are more theories, more points of view, more complexity, and more confusion. There is less clarity, less consensus, and less confidence....

More data inevitably results in more time spent arguing, validating, reconciling, and replicating. All of these activities create doubt and confusion. There is a real danger that Big Data moves us backward, not forward. It threatens to take science back to the Dark Ages, as bad theories gain ground by gathering bad evidence and drowning out good theories.

Big Data is real, and its impact will be massive. At the very least, we are all consumers of data analyses. We must learn to be smarter consumers. What we need is Numbersense.

Here's your chance to improve your own Numbersense by winning a copy of Kaiser's book. (If you went to Discovery Summit earlier this week, where Kaiser served on our statistics panel discussion, you got a free copy of the book at the conference. Kaiser signed books for attendees.)

The first 25 readers who leave a comment telling about a challenge you've faced that involves the collection, processing or analysis of increasing  amounts of data will qualify to receive a free hardcover copy of Kaiser’s new book, Numbersense. Your contribution to this discussion should be between 50 and 75 words long. Be sure to enter your e-mail address when you write your comment so we can contact you if you are a winner. Only one book per commenter. Commenters must reside in the US to be eligible to receive a book. Thanks to Kaiser's publisher, McGraw-Hill, for providing the books.

Update on Oct. 1, 2013: This contest is now closed. All books have been awarded. Thanks for your interest!

Community Member

Chris Bodily wrote:

I often encounter situations where Big Data leads to big trouble. Many times analyses are attempted by collecting the data in a spreadsheet to view it and identify sub-data sets. Other times, big data implies big reports with dozens or even hundreds of plots and tables. Further, individuals spend hours in the process and end up with output that still does not provide the needed insight and guidance. With the proper tools, proper data validation is possible and effective analysis leads to learning. Even visualizing data effectively can yield clues and information prior to any other analyses.

Community Member

Glenn Waddell wrote:

I am a math teacher and AP Stats teacher in a public high school. The amount of data we generate is huge, but no one ever has time or energy to actually go through and see what we can do differently, improve upon, or even what trends or patterns are in the data. I can teach, or I can analyze data, but I can't do both! I am always trying to learn how to do both better.

Community Member

Robert Lochel wrote:

The last 2 years, I served as an instructional coach in my school district. With the advent of Keystone exams here in Pennsylvania, there a need to collect data on student performance throughout their algebra 1 course. The state provides an online Classroom Diagnostic Test (CDT), which adapts questions based on student responses. The test provides a score, but little predictive correlation to the actual Keystone Exam. Now that we have our first round of student performance data, I have been working to compare student Kesytone scores to student CDT results in order to provide feedback to students taking the CDT, and provide a path for improvement.

Community Member

Wilfredo Salinas wrote:

In my job, we have to analyze sales transactions from several hundred locations. In our analysis we would love to look at every transaction that happens at every minute, but the data is so overwhelming that we have to narrow it down to just a couple of hours of the day. The second challenge is exporting that data from SQL server to a spreadsheet. Our current limit is only 1 million rows. This forces us to create multiple sets of files by days of the week, time of day and sets of locations. Our only way to get the "big picture" from this set of big data is to create linked references. In multiple occasions these links do not refresh or eventually get dropped. It can be very frustrating.

Community Member

Sastry Pantula wrote:

Big Data are providing BIG (Business, Industry and Government) opportunities, especially during this International Year of Statistics. Unfortunately, some are hiding the data and are not making accessible to make use of, test independently or train future problem solvers. There is also a need to build the capacity of statisticians around the globe to meet the demand.

Community Member

Kehleboe Gongloe wrote:

In 2008, I was asked to organize and conduct Labor Force Survey for Liberia. No prior experience in doing such a research and no prior labor force survey conducted in Liberia. Where and how to begin was a tough challenge. Our savior was the International Labor Organization Department of Statistics. We did collect the data and analyzed it but the Government only granted permit to release the results more than one year later. This challenge got me back in school and I am studying statistics. The whole concept of big data is appealing to me increasing and I am beginning to lean toward it in a serious way.

Community Member

James Stiebel wrote:

While Big Data provides great opportunities for insight, it provides a greater amount of problems, complexity and confusion (especially when you consider interactions). When training new analysts, I avoid Big Data like the plague. Little Data rules!

My experience has found Little Data to be much more manageable, friendly and valuable. With smaller data sets, I find my ability to test variable combinations, simulate multiple data sampling and build multiple predictive models much easier/quicker. Big Data requires much more time to be invested in data preparation and analysis to drive insights. Not sure if Big Data is worth the Big Effort.

Community Member

Keith at EAVI wrote:

Totally agree with the points of view raised here.

One of the big challenges with the increasing use of big data is it's democratization. I am already seeing the suggestion that many more people (who aren't necessarily trained in the use of numbers) are going to be able to do analysis because it will be easier for them to get access.

Cue the spike in internal queries about why this number doesn't match that number (which aren't actually the same - ie. different time period, different calculation).

Not really looking forward to a range of people having completely contrary points of view from the same set of data and those spreading, and there not being an analytical expert who had first chance to get a story out there.

PS. I'm in Australia so getting a book is probably going to be tough...

Arati Mejdal wrote:

Hi, Keith. Thanks for your comment. But as you can see from the giveaway rules in the blog post, this offer is open only to US residents. We cannot ship a book to Australia. Sorry.

Community Member

Mark Ewing wrote:

I work for a large, international corporation and often just the data we generate internally from sales, pricing and other systems is huge and stored all over the place. When analyzing these systems we have collect all the appropriate data, join it correctly and only then begin to analyze it. Because I'm deploying my solutions to my clients, it has to be optimized for speed and space which is never an easy problem to solve.

Community Member

Mark Ewing wrote:

I wasn't sure if the post had be be between 50-75 words or if 50-75 words was the minimum range, so I culled my comment to 75 words to be safe - I'd love a copy of the book!

Arati Mejdal wrote:

Hi, Mark. Thanks for your comment. Comments need to be between 50 and 75 words. Yours qualifies!

Community Member

Scott Drumwright wrote:

I believe that Big Data is an evolutionary step and not the latest new technology fad as some think. You will never choose to analyze less data once you have access to more. It is built within us to want to map and explain the world we live in completely. This is the first time for me that the next new thing in technology comes with a morality warning!

Community Member

Arved Harding wrote:

As we tackle big datasets in an effort to glean useful information we may find ourselves facing barriers like people who have been doing this in the manufacturing environment for many years. Just mining data can be a futile endeavor. Systematic thinking can be helpful: determine data needed, how and where to obtain the data, then clean the data of junk. Now using the great computer and statistical tools of today we can start mining.

Community Member

Mike Paulonis wrote:

Log files from computer infrastructure have often been ignored, either because the contents are deemed to be of low value, because the contents are too unstructured, or because there is just too much logged data (which just continues to grow). Big data techniques are very applicable to the challenging problem of log file analysis and are unlocking the value for even unsophisticated users.

Community Member

Antonio from home wrote:

Dealing with big data as simple individuals is very difficult. In some cases web tools can help. On one occasion I had to study information from a web source generating data non-stop over time. Writing a google script and setting up a time driven trigger allowed me to collect and store partial statistics in a google spreadsheet by few minute basis. This greatly simplified the final analysis.

Community Member

Anna in Indiana wrote:

My current and previous job both revolved around massive amounts of electronic medical record data. There are many challenges when working with this data - if you're trying to show 10 years' worth of data on a map, which zip code should you use (and what if it looks like an area has 300% of the population given on the US census!)? If a patient reports a different race at each visit, how do you determine which one to use? How do we aggregate data to a level that is meaningful and yet protects privacy?

Community Member

Chris wrote:

With Big Data comes Big Responsibility. While we like to think that on the analysis side we control the flow of ideas, as more ideas get out there, they get a life of their own. I have challenges in my company making sure that the case I'm making is read the same way by end users who may never talk to me, just see some slides or charts out of context of the larger discussion.

Community Member

Anand Chandarana wrote:

In the human capital analytics space, my biggest challenge is program owners and other stakeholders who want the data to make something out of nothing. "Garbage in, garbage out" as they say. If you want to tie one set of data (e.g., engagement survey data) to another (e.g., performance ratings), there needs to be an actual link between the two and a steady flow of data. Performance rating or engagement survey data that is collected on an annual or even bi-annual basis (in the latter case) are not very helpful. Engagement survey data is much more valuable when it's collected via pulse surveys on a quarterly/ongoing basis. Performance ratings are often so undifferentiated and inaccurate that they rarely correlate to other human capital metrics.

Community Member

Shen Ting Ang wrote:

I work with speech data and there are still questions that have yet to be fully resolved. Various methods of feature extraction have been proposed and used, but each of these has its own set of useful and not so useful properties. Depending on the task and methods used, it is usually a case of trial and error to determine the best combination, and often, there is no universally best solution.

Community Member

A.P. Williams wrote:

Our greatest challenge is typically avoiding the injection of our biases into our interpretations. We analyze sales and marketing data to determine the impact of our activities on outcomes, but rarely do we have a true control group. This leads to a lot of challenges in both analyzing the data and socializing our findings due to the often inconclusive nature. It is difficult to slay the â dragonsâ (preconceived notions of our stakeholders).

Community Member

Maegen in Boston wrote:

At my company, we analyze large amounts of sensor-based data collected in corporate environments. The in situ nature of the data means we must be very careful about finding patterns in noise, and must constantly be on the lookout for odd sensor behavior. Further, the nature of the data makes it difficult to apply typical pattern recognition algorithms out of the box - we must thoroughly test our set of assumptions. As software and hardware improve, we are faced with the challenge of ensuring our analyses are reproducible and (when appropriate) comparable across projects.

Community Member

Dale wrote:

The biggest data challenge Iâ ve ever personally faced involved modeling and decisionmaking in the electricity markets. The largest US market, PJM, has some 10,000 unique nodes, each of which issue data at hourly (or greater) frequency. Add to that temperature, fuel quantities and pricing, and generator availabilityâ it adds up to a significant difficulty to separate a useful signal for commercial decisions from all the noise.

Community Member

Dave wrote:

I work at a retail power company, and daily 15 minute interval usage data, weather data, call center transcription data, web log data, plus the increasing need to analyze this data quickly in a distributed environment, are causing us to check out new solutions. The interval data alone is about 2GB per day. Some regressions now take weeks on a multicore unix server, which is unacceptable.

Community Member

Jan Kölling wrote:

I recently started my PhD studies in bioinformatics which involve the analysis of large amounts of biological data from multiple sources. I try to develop tools that help the biologists to answer their questions more efficiently and to explore the data sets beyond their initial incentive. A big part of my work involves communicating the benefits of better analysis and visualization and be aware of the pitfalls that might invalidate research results.

Community Member

Glen wrote:

Some of my research involves distributed computing for big data. I define big data as data too large to load onto a single computer. The most common tool for analyzing big data is Hadoop. However Hadoop was designed by computer scientists and does not work well for iterative statistical problems/data visualization/Bayesian methods. It is an exciting time for Statisticians to contribute to this area.

Community Member

Murray Meehan wrote:

I work for Microsoft's sports entertainment division, in an analysis role focused on cloud-based web services and client apps which rely on them. I generate gigabytes of test data a week about web service functionality and performance, almost all of which is ignored in favor of very simple metrics which are easier to communicate between teams. When I do more complicated analysis, I struggle to present it in a way which my coworkers can understand. This is a common problem in my area.

Community Member

Chris S wrote:

This article is especially apropos. I just came from a meeting discussing the double-edged sword that the democratization of data provides. Giving more people access to more data can be a very powerful thing that all organizations want to leverage, but exactly as articulated, the danger is this leading to confusion and churning. The question, as always, is how to maximize the benefits while minimizing the costs?

Community Member

Dan D wrote:

As someone who works for the government the collecting of data is rarely the issue but the ability to use it is always tough. How much detail can you go into without effecting privacy and what information is useful to people. As a data scientist, I always feel the opinions of those who don't know anything about the data have guided our analysis and decisions in the right direction.

Community Member

brett wrote:

One problem I have ran into with big data is dirty data. Companies wish to utilize the information contained in CRM and ERP databases and believe that a little bit of programming will produce useful models. Even with heavy data cleaning, the remaining data is so poor that it no longer has any useful relatable population for inference.