In the recent Statistically Speaking on demystifying machine learning and artificial intelligence, the renowned author, statistician and educator, David Hand’s opening plenary set the stage for some very interesting discussion with our expert panelists: Cameron Willden, statistician at W.L. Gore; Sam Gardner, founder of Wildstats Consulting; and Jason Wiggins, senior systems engineer at JMP.
In the previous post, our panelists answered questions on expertise and training that we received during the livestream event. In this post, they share their thoughts on questions about data and modeling. The first question is aptly about dark data, which is the name of David Hand’s latest book, Dark Data: Why What You Don’t Know Matters (highly recommended).
The opposite of dark data is bright data. Isn’t that a rarity?
Sam: Often in the early stages of an organization’s data science journey, they discover that they have lots of dark data issues. The primary problem to be solved is how to get the right data. Organizations that really want to solve problems – and solve them well – will have approaches, strategies and systems that let them generate and collect “bright” and good data.
Can JMP "fit" only the missing data, e.g., with support vector machines or GenReg? To somehow use the structure of the missing data with the existing data to fit it – kind of like an inverse fitting? This would be different from "informative missing."
Jason: I would check into multivariate SVD imputation and automated data imputation in the Explore Missing Values utility and see if these will do what you are hoping for.
Sam: JMP and JMP Pro can do multiple imputation that is based on the correlation structure of the data. See the online help for Impute Missing Data and Explore Missing Values.
What kind of machine learning model can we use to tackle multilabel data that is not mutually exclusive?
Cameron: If I understand you correctly, you can accomplish that simply by having multiple responses. For example, a model might process and image to classify someone as old or young, and separately as male and female.
What is the best way to approach predictive modeling for a process output(s) with high variability in process data?
Cameron: The best way to overcome high variability is with a large volume of data, e.g., the standard error of a mean has the square root of the sample size in the denominator.
Jason: Identify sources of process variation; invest in measuring and collecting data from them continuously. How many critical process inputs are measured in your process? Are there enough to adequately describe variation in the outputs? I have participated in and led projects aimed at understanding problems like this. They can be tricky, but they are necessary. Modeling is not magic. You need good information to be successful. I would invest here first before diving into esoteric models.
When wanting to train a new model on a small data set in order to get the rewards sooner (money saving), what are some good, practical tips for doing this – even if it's to use a less than perfect model, but one that’s "good enough."
Cameron: In the deep learning realm, transfer learning offers a way do a lot more with less. Designed experiments are another way to get more information about relationships per data point because you plan the points in such a way as to minimize the amount of redundant information. That is, you generally have very little correlation between factors, and thus variance inflation factors equal to or very close to 1.
Jason: Strategies like “leave one out” or k-fold cross validation may be useful when training a model on smaller data sets to avoid over-fitting. I have had success with penalized regression models on small data sets using both cross-validation strategies. When you build a model, how often are you checking in on actual versus predicted? If your predictions are suitable to make business decisions the model may be “good enough.”
In your experience, what is the percent of time where you need to apply a continual/iterative learning model vs. offline/one-off model to solve problems? What are the pros and cons to each?
Cameron: That is going to vary widely by the area of application. If I am modeling demand for a product based on leading indicators, I should absolutely expect for those relationships to evolve over time. Having my productionized model evolve and adapt over time would be highly advantageous. If I am modeling machine outputs based on a bunch of physical properties, then the relationships will probably remain stable over time, and I don’t need to worry as much about online learning. The exception might be if I want the parameter estimates of my model to become more precise over time as more data comes in.
When you are not there to investigate/watch the process, you may miss the cause of the issue; also, COVID complicates this. As researchers/problem solvers/data scientists, what do we do when we cannot be "there" to collect the data?
Cameron: Good question, and the only thing I can really think to say is to figure out how to involve the people that collected the data with the analysis of the data. Even if they have no analytics background, they can probably understand a graphical analysis well enough to participate in a discussion and bring up insights from their first-hand experience to contribute to the discussion and interpretation.
Jason: This is a tough one! Are there virtual tools you can use to see the process? Who is running the process? Can they be trained to make observations about the process that can explain issues that come up in the data? Can you do virtual process walks?
Abundance of data may lead to the false security that the data is unbiased. What are the mitigation steps of using an abundance of biased data while new counterfactual data is arriving?
Sam: Test your models. Always look to see if the model you developed is still predicting well. Make sure that the “support” of the model (the range of the predictor variables used to build the model) is well understood and put in checks to see if new situations that use the model are predicting outside that “support” range. When you have collected enough new data, the models can be retrained, so you should compare the old model with the new model to see if they are making similar or different predictions.
What type of computer/machine is needed for working on AI/ML models? Can it be done on a laptop with ample RAM?
Cameron: It really depends on what you’re doing. Not all ML is done with huge data sets and models that contain millions of parameters. You probably have enough processing power on your smart phone to do a lot of AI/ML. Once you start working with data sets that are even in the tens of gigabytes in size, then you might start to run into the limitations of what you can do on a PC, even if you are processing the data in batches. If you are doing deep learning with images or video, then parallel and GPU computing become very important; at that point, you are probably using shared computing resources within your organization or using cloud-based services. Google Colaboratory is a free and easy way to access GPU computing for deep learning. You can pay for access to higher tiers of GPU computing power on that service as well.
Jason: What type of ML models are you wanting to use? How much data do you have? Many of the modeling exercises I have done have been on a laptop workstation: quad core processor, 32 GB RAM, etc. I worked on a defect detection project where we used Convolutional Neural Networks to identify manufacturing defects from camera feeds in process. We built high-end custom gaming PCs to tackle this problem and were looking into a server deployment before I left the project. The biggest difference between my laptop workstation and the gaming machine is the graphics card. If you plan to do image analysis and/or GPU computing, a laptop graphics card may not be enough. If speed and memory are an issue, you may want to use a server or investigate distributed computing.
What is the tool/technique (cleaning, combining, modeling or otherwise) that has helped you most to develop in your career?
Cameron: Curiosity, willingness to try hard things, dogged determination to figure out ways to overcome obstacles, and learning by doing have been far more important than any specific tools and techniques. You pick up what you need along the way.
Jason: Before JMP had the Recode utility, I think one tool that helped me was leveraging arrays and matrices for data cleaning. It is amazing what a little matrix math, matrix manipulation and/or logical indexing in a script can do to find and fix problems in data. Regular Expressions is another tool that has made a big difference in my career with respect to cleaning/munging text data. Now that we have Recode, this is my favorite data cleaning tool! It supports Regular Expressions for text, and it can easily find problems in numeric data. Since you can save the results as a formula, recode automatically does most of the scripting I used to do. Explore Outliers and Explore Patterns utilities also do a lot of what I used to handle with scripts for data cleaning. All are big time savers! Above all, I would say utilizing data viz through all stages of the analytical workflow to unlock the story behind data has had the biggest impact on my career.
Sam: Honestly, making good graphs of data is probably the most effective thing I can do. If you put it in a picture, it will tell the story of the data and the problem. The second most important tool is writing: communicating the results of an analysis, summarizing what was learned, and making recommendations for further action. If you don’t write it down, people won’t remember or know what happened.
How can you convince chemical or pharmaceutical industries that base their processes (operation and control) on mechanistic models to jump into black-box alternatives? Are hybrid approaches the first step?
Cameron: Are the mechanistic models holding them back in some way? For me, a mechanistic model is always preferable when it is well within reach. Just like you don’t need a physics education to understand what will happen when you throw a ball, a computer can learn how something works by seeing lots of examples. However, if you know all the physics calculations and parameter values, you need little, if any, data to make accurate predictions. The better you break down a situation to first principles, the less data you need, and the less you have to worry about extrapolation as well. Data-driven models (I’ll use that rather than “black box”) are incredibly useful for working around knowledge gaps or approximating across complexities. If getting a 100% answer would take months, but you can use a data-driven model to get a 95% answer right now, what would you choose? What is the opportunity cost with the first option (think what you could do with all that extra time)?
We had questions about what headset Sam used (a Plantronics 655 DSP) and how to get his book. The e-book can be ordered directly from SAS Press and hard copies are available on Amazon. The second edition is in progress and should be available in 2021!
You can watch the on-demand version of this episode of Statistically Speaking and read an excerpt from the science chapter of Dark Data. Being able to categorize and better deal with all the dark data out there will make your models much more useful. We thank our viewers for asking such good questions, and we thank Cameron, Sam and Jason for sharing more of their expertise!
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.