cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Try the Materials Informatics Toolkit, which is designed to easily handle SMILES data. This and other helpful add-ins are available in the JMP® Marketplace
Choose Language Hide Translation Bar
Are big data and machine learning methods enough? Part 1

Screen Shot 2020-10-26 at 3.31.39 PM.pngDark data, expertise, modeling — we covered a range of topics on the Oct. 1 Statistically Speaking. David Hand gave a brilliant plenary talk and set the stage for a great panel discussion by cautioning us to remember that thinking is required and to be aware of all the dark data out there — the data that we don’t see, but that we need to take into account. Dark Data:  Why What You Don’t Know Matters is his latest book (see a blog post about it; if you haven’t read it, you can get a sample excerpt). 

The panelists included Cameron Willden, statistician at W.L. Gore, who supports engineers and scientists across many different product lines; Sam Gardner, founder of Wildstats Consulting, with more than 30 years of experience doing statistical problem solving for government and industry; and JMP’s Jason Wiggins, a 20-year US Synthetic veteran with expertise in process optimization, measurement systems analysis and predictive modeling/data mining. 

We ran out of time before we could answer all the questions from the livestream audience, but our panelists have kindly agreed to provide answers to many of them, further sharing the wisdom from their collective experiences. The questions are grouped by topic — there were so many, we are doing two posts. In this first part, we will focus on all the questions about expertise and training. Part Two will cover questions about modeling and data. 

Data science seems to be permitting a whole lot of people who are not analytically trained to get analytical jobs. What are your thoughts on this?

Cameron: If people are doing data science but all they know is, “You put the data in this end, and get the answer out this end,” the people who make decisions based on that work are in for a world of hurt. I am not afraid of democratizing all sorts of analytics, but people still need to have a good foundational understanding of statistics and the limitations of every tool in your toolbox. More precisely, they should be able to reliably recognize when they are about to run afoul of those limitations in the real world. Getting to that point doesn’t require a master’s degree in statistics or machine learning (ML) but getting there through some combination of formal training and informal apprenticeship seems realistic to me. 

Sam: On the whole, I find this to be a good thing. Innovators (like JMP!) have created really wonderful software tools that have enabled a “democratization” of analytics, where the primary skills required are less about programming and mathematics, and more about problem solving, teamwork, business or organizational acumen and communication. There are more problems to solve than there are people to solve them currently, and what organizations need are people who have a high “Analytics Quotient” (AQ). High AQ employees can come from a variety of backgrounds and have a varying set of skills, but they still all do need to understand the fundamental ideas in data science.  

I think there is a notion that data scientists don't need to know statistics. I feel it is a dangerous trend. What is your opinion?  

Sam: Although, as I replied above, having people with high AQ and access to good software tools can enable a business to do data science across the business, you still do need a cadre of people with deep knowledge in the mathematical sciences. The reason for this is that they are the core team that will develop better internal tools and processes for data science. As the business grows its AQ, the easy problems will get solved and the remaining problems often require a more technical expertise to solve. Plus, statisticians typically have a keen sense for how to solve those difficult problems. So, yes, it’s good for your organization to have access to a statistician, and don’t just assume you can grow your organizations AQ without the help and services of a statistician.   

Jason: I have observed that people who are good analysts have a compulsion for understanding what they are doing, which means learning the statistics. It is apparent in a presentation or technical report when someone has a deep understanding of what they are communicating. Do they know enough to communicate findings and methods simply and answer the hard questions that follow? I don’t think every data scientist should have the knowledge of a mathematical statistician, but they should have a pretty good grounding in theory and application of the tools they use.  I also think there is a distinction between data scientists that are analysts and those that are data architects. In some companies I interact with, these are different job functions with different leadership, but they are both referred to as data scientists. The data architects may not need as much understanding of the statistics as the analysts, but they should have enough to guide their work. Everyone can benefit from a better knowledge of statistics! Why just require it of the data scientists and statisticians? 

I need some help in the ML and AI fields. Do I need to hire someone with business domain expertise, or can I provide that to simply a talented data scientist with the ML and AI expertise? I've got some basic stats and data science skills. 

Cameron: Data scientists are often defined as people who have strong statistics/analytics, programming and domain expertise, but fortunately, that last one is typically acquired on the job and to the level needed to be effective. As an industrial statistician at Gore starting in the Fabrics division, I had to learn an awful lot about textiles, lamination and web handling; but I was able to learn enough after just a few weeks to start asking questions the engineers hadn’t thought to ask about their process. By partnering with them, they were able to fill in a lot of the knowledge gaps I had in order for us to collectively create an effective strategy to solve their problem. I went through the same experience in our Medical Products division, HR, and now more recently, Supply Chain. You might find a candidate who did an undergrad in your domain but completed a graduate degree in statistics/ML/data science. That seems to me a rare exception and you might wait an eternity to find that candidate. 

Jason: I rarely had the luxury of hiring someone with both domain and data science or even basic statistical problem-solving expertise. I looked for people who are capable learners and have demonstrated success in learning new concepts. If a new hire has the aptitude, you can fill in gaps in expertise with training. I believe that having a solid onboarding process with training is fundamental to a new hire’s success. I think most companies have the knowledge to train domain expertise but may still need to design the curriculum and structure to deliver it. If your company does not have the level of skill needed to teach ML and statistical problem-solving skills to someone with domain expertise, find good consultants to partner with until the overall skill level reaches what is needed to teach internally. Every company can benefit from having a deep pool of expertise in the statistical problem-solving realm.  

There is an ongoing debate over if it’s better to have "good" data scientists with domain knowledge and might have a "bias" toward data/result interpretations vs. having "pure" data scientists who can collaborate with domain experts instead. Thoughts?

Cameron: Domain expertise would always outweigh concerns over potential for bias in my opinion. Domain knowledge helps you anticipate problems, understand nuances that have a big impact, and interpret the results correctly. However, I think it makes more sense to hire based on analytics and coding skills over domain knowledge since that can usually be acquired on the job, leaning on others in their team for that expertise in the interim.  

Jason: I would like to address the second part of the question first. If you pair a data scientist with domain experts to solve tough business problems, won’t the data scientist develop domain expertise? I agree it is good to work collaboratively/cross-functionally to bridge expertise gaps. We need to rely on our colleagues who know things we don’t. I also think it is good to have fresh perspectives. At the same time, how do you know which questions to ask without domain knowledge? That must come from somewhere. Why not develop domain expertise in everyone, including the data scientist? All scientists should be aware of cognitive bias and be on the lookout for it personally and organizationally. A good way to deal with this is to be disciplined at stating analysis goals, hypotheses, assumptions and potential gaps before embarking on experimentation or analysis. If you don’t, it is too easy for our minds to distort the original thinking or intent to match what is presently visible. Distortions like this can totally mask the Eureka! moments that we all crave. 

Sam: It usually takes a team to solve a problem, and everyone on the team plays a role. The “pure” data scientist can focus on more technical details but needs to be able to communicate with and have the trust of the rest of the team. That person also needs to have good business acumen, to be able to see the big picture and understand the politics of the situation and organization. A good data scientist can also learn the business domain knowledge so that they are a more effective member of the team.   

You all confirmed the optimal setting for data scientist/machine learning engineer as being a team environment. When looking for a job in this area, what are the red flags in job descriptions?  

Cameron: Major red flag is when the job description describes a data science “unicorn” and the listed responsibilities include things a data engineer, app engineer, etc., would typically do. Look for phrases like “cross-functional” and “team,” or descriptions of the types of people you’d be working with as an indication that the organization has a mature data science function, or has at least been thoughtful about how they are going to grow that capability. 

Even when people talk AI, are they still using statistical software such as JMP to analyze the data and create models?  

Cameron: AI doesn’t always require a custom solution, and it’s certainly not always deep neural networks!  There are so many ML methods for supervised and unsupervised learning that can be done really well with point-and-click software like JMP.  

Sam: Great data scientists can be empowered to do great work if they have the tools that allow them to do analytics. Yes, many data scientists will have JMP in their analytical toolbox! The modeling capabilities of JMP are diverse and comprehensive, and while you may not be able to do everything you need to do in JMP, it can still do a lot for you. Combine that with its data visualization and exploration capabilities, it really can enhance the speed and quality of the AI work that is done.  

Can you recommend an online course for an introduction to machine learning?  

Cameron: I’m a huge fan of Andrew Ng’s Machine Learning course on Coursera as an intro. It covers neural networks, but also regression, logistic regression, support vector machines, etc. If you want to learn more about advanced neural networks, then do the Deep Learning Specialization through deeplearning.ai, also in Coursera. Andrew Ng’s Machine Learning course uses an opensource Matlab clone called Octave, which you’ll probably never use again. You’ll use Python via Jupyter Notebooks for all the deeplearning.ai courses, which is far more useful, and you’ll also be introduced to the most popular ML frameworks in Python such as Keras and TensorFlow. 

Jason: I have taken several Coursera modules. I think this one on Data Science Specialization is a pretty good place to start. Towards Data Science is a great resource, too. 

 

We appreciate the many questions our viewers submitted during the livestream of Statistically Speaking, and we appreciate Cameron, Sam, and Jason for taking the time to share their valuable perspectives on expertise and training.

We have one more training suggestion: Statistical Thinking for Industrial Problem Solving is available from JMP as well as from Coursera. In the next post, our panelists will answer questions about data and modeling. 

If you didn't get a chance to see the panel live, you can watch the recording of Are Big Data and Machine Learning Methods Enough? at our website.

Last Modified: Nov 10, 2020 3:29 PM