Milking the Data at Dairygold: From Bar Charts to Neural Networks (2021-EU-30MP-...

Kieran O'Mahony, Data Science & Analytics Manager, Dairygold

In the modern world, the use of data analytics is typically more associated with industries such as finance, health care, pharmaceuticals, medical devices, semiconductors, e-commerce and the like. Rarely are the words agriculture and analytics found in the same sentence. This session will give an insight into how one such business, steeped in tradition and history, has embraced and encouraged the use of data analytics to ensure that it can no longer be considered a business deemed to be “data rich, but information poor.” This session will show multiple examples of how the use of JMP has enabled more data-driven decision making across Dairygold.

Auto-generated transcript...

Speaker	Transcript
Kieran O'Mahony	Okay, so thank you very much, my name is Kieran O'Mahony and I'm the data science and analytics manager for Dairygold in Ireland.
	And the title of my presentation is Milking the Data at Dairygold from Bar Charts to Neural Networks.
	So I think it's fair to say that the use of data analytics is typically more associated associated with industries such as finance and healthcare, pharmaceuticals, semiconductors and the like, and it's probably rare that the words, agriculture and analytics are found in the same sentence.
	So this session is intended to give an insight on how some very...how a very traditional business has both embraced and encouraged the use of analytics across the business.
	So this session is probably going to be different than the other sessions that you might see at Discovery, in the sense that what I'm trying to do here is less focus on one project, but rather show multiple examples of how
	Dairygold have used JMP to enable some more data driven decisions across the business, and not just in manufacturing as well, which I think is important.
	So, before I start I suppose it's important to give a bit of a background on the business, an overview and what it is that we do.
	So Dairygold is Ireland's leading farmer owned cooperative with state of the art cheese and nutrition powder manufacturing facilities located in the heart of the Golden Valleys.
	Okay, so just to explain that, as a cooperative we basically...basically means that we are owned by the milk suppliers, so by the farmers.
	So we have 2,800 milk suppliers that supply us milk on a daily basis and every single one of those are shareholders, so the business model is a bit unusual in the sense that the raw material suppliers are also the owners of the business.
	So the turnover last year was 1.2 billion euros and we have about 1,200 employees and we process about 1.4 billion litres of milk a year, so it's quite a lot of milk coming at us on a daily basis. In fact,
	at peak, because we have a seasonal business, at peak production we're talking about over 6 million litres of milk coming at the sites on a daily basis.
	We also have retail stores, where you can buy anything from a hammer to household electricals and farm equipment and everything in between. And we also have the agri side of the business as well, where we produce animal feed in our mills.
	So Dairygold has been around for a long time, and it started off back in 1908 as Ballyclough and then became Mitchelstown in 1919 and then Dairygold in 1990.
	And we've probably come...well not probably...it's fair to say we've come a very long way from milk being delivered by horse and carriage, back all those years ago.
	It is a very traditional organization and has a lot of history, but it's very progressive organization as well.
	And when I say we've come a long way, you know we we have a very large supply chain with a fleet of tankers now, and each tanker,
	you can see an example here, each tanker hauls about 26,000 litres of milk and we collect milk daily from our 2,800 suppliers. And it's very complex supply chain in the sense that
	not only do we collect milk from our suppliers, but we also collect whey from our cheese processing plants, if we're not automatically directing it
	through a pipe to a different site. So we collected it in tankers also and we transport to other sites.
	And we do the same for skim milk concentrate and demineralized whey concentrate that we sell to other customers, and then we also have the powder part and cheese part of the business where we're distributing also.
	So we also have some of the largest dryers(?) in Europe, you can see here one in particular that's spanning five or six stories.
	So, in the last five or six years, Dairygold invested over 400 million across the plants, so it is ensure...to ensure that we have...
	we can meet the needs of the the future volume expansion in the milk pool(?). In terms of products we make, we make
	powders and cheese, for the most part, from whole milk powders to skim milk powders and and protein powders and lots of different types of cheese, especially cheese and cheddar.
	And we supply multiple sectors from confectionery and pizza to savory to yogurt and processed foods and and even infant milk formula as well, so baby food.
	And it's likely that you've...if you've eaten at all today, it's very likely that there's a good chance you've also consumed some of skim milk power or whole milk powder because it's just so prevalent, these days, in so many food ingredients.
	So that's the background on the business. In terms of data, I think it's fair to say that there's a lot of businesses out there that are deemed to be very data rich but information poor.
	And it's also fair to say that the more data you hav,e sometimes what can go with that is a lot of more confusion as well.
	And sometimes I think it's fair to say that even when you have all of the information and you have...so you have this population, and you have, you know, an idea of what exactly
	you need to extract in order to get useful information from the business, sometimes businesses don't exactly know what to do with that data.
	And from Dairygold's perspective, what we want to do is try and move more away from the right side of the brain, where it's focused on using, I suppose, assumptions and gut feeling and guesswork and intuition,
	and moving more towards the left side of the brain, where it's more focused on using logic and math and reason and science, but this is what we want to be using more in the business and using more data driven decision making across the business.
	I was asked once in advance of a presentation, to give a summary on Dairygold's data strategy and to do so on a single page, and I guess at the time, I...
	it's when I realized that we probably didn't have a solid data strategy at the time.
	But what I'm trying to do here now is just show in a single pictorial what our actual data strategy is and it's real simple.
	So we want to take data that exists in the business and use analytics to turn into useful information.
	And from that information we get answers to the key questions that we had, and sometimes not only do we get answers to the questions that we had, but sometimes
	we actually end up with more questions. But these are really important and really powerful, because these are questions that we didn't know to ask from the outset.
	So sometimes we need to go back and look at our data again and help us get answers to these questions.
	But ultimately when we get to the point where we have answers to all of our questions, we're in a much better position and we have a lot more insight on our processes.
	And with that insight, we are then enabled to make some timely data driven decisions and give us the right...send us send us off in the right direction.
	And if we understand our process well enough, we have enough insight, then we should be able to use that data to be able to
	understand what potentially could happen in the future, with a high degree of accuracy. So good insight should enable us to have good foresight as well, where we can predict what's going to happen.
	So, as I said at the beginning, the purpose of this presentation is to give a number of examples of the different ways in which data has been used across Dairygold.
	We are, I would say, you know, long down the path from an inside perspective, in terms of making decisions based on our data, and we're starting into the whole area of machine learning and AI at this point.
	So I'm a big believer that simple can be very powerful as well.
	And that's why this presentation is called from bar charts to neural networks.
	And I think it's really important to to highlight that you can extract a lot of value from very simple analytics. So very first example I want to share here is, it's just bar charts.
	But sometimes I think people can go into presentations and have bar chart spanning across multiple pages and with JMP, it's so easy just to put all the information to a single
	view. So in this case, we're looking at nitric acid in a cleaning process and up here on top, we can see each individual data points and for each of all of the stages of a nitric acid cleaning process.
	And you can see the the level of nitric acid that was used across each of the steps. The next step we're looking at...looking at the number of uses, so the number of times that step was cycled.
	And here we're looking at the mean use of nitric acid for each of the steps. So as an example, we can see that,
	you know, we have a higher average here of 24 liters per week being used, but the frequency of occurance of that step was only four times,
	and it was clearly biased by this outlier here, so this is obviously skewed, and probably more importantly, for us, because what we're trying to do here is understand
	and reduce the amount of nitric acid that's been used on a daily, weekly, monthly basis, is the sum of the nitric acid used across the process for each of the process steps.
	And what was important...what was important for us to highlight here, I guess, and again it's just simple bar charts, but it gave us huge direction
	in the business and sent us off...sorry, avoided sending us off on a tangent because when we looked at all this data, we can see that
	tjree different areas jumped out. We had final rinse, pre rinse and another final rinse step as well. And the data was showing
	that acid was being used during these rinsing steps, which is not possible because the rinsing steps just involve water.
	So a little bit of further digging showed that incorrect data tags had initially been set up by a vendor on our scanner(?) system
	for our monitored flow rates. So where we thought we were actually monitoring nitric acid usage, we actually weren't. We were being incorrectly informed. So
	we could really have gone off and dug deep and understood our process and what was driving it but very simple bar charts centers often a proper direction and we were able to accurately understand what was really happening in our business.
	And our use of bar charts in the business and, if I'm honest, this is probably one of my favorite uses of the bar charts into business because it's so simple.
	But it shows...it shows four different dimensions of downtime all on one page, really simple and very user friendly and intuitive to use and read, I think, as well.
	So what we have here is the amount of downtime being monitored in a cheese plant per month. So you're looking at minutes downtime on the Y axis here
	for each of six different areas inside of the cheese plant, for each of six months April, May, June, July, and so on until September in descending order, in time order, I guess, really
	for each of shifts A, B, C and D. Now, you might be forgiven for thinking that the data has been made up because the numbers of minutes are so low, but it's really important from a
	cheese manufacturing perspective that the plant doesn't stop. So these are real numbers, I mean we're monitoring downtime in in minutes per month.
	If the process stops, then the pH of the product changes and that can adversely influence
	the final sale price and quality of the product as well, so it's important that we don't stop.
	And so, this is the level that we're at from a maintenance perspective. We're down to minutes and clearly what we can see here is that not too many issues with BetaVac, but for the blockformers, we're seeing issues with shifts C and D.
	with more issues here for C on Blockmaster. In this area of conveyors, again C are jumping out again, and again for gusseters C and D.
	And maybe more issues with C and D here. So what we're able to extract from this analysis was that shift C and D are struggling a little bit from a maintenance perspective and they needed a little bit more support.
	We particularly like the dynamic linking
	element of JMP as well for data exploration. And, in this case, I should point out, we have a lot of labs within the business, so we've microbiological, chemistry, ??? labs,
	and we test milk and cheese and water and in process and finished product and we test a huge number of elements of the business.
	And this interactive linking function which was, as you can see here was made into an interactive HTML,
	was used so that different people across the business could work to build a repository and then be able to view it and interact with it.
	So what we're trying to do here is create a repository whereby we have problems that exist within the business, so in this case I'm highlighting block...sorry I'm showing blocked tubes and evaporator CIP and power cut.
	And when you click on power cut, you can see the shaded areas of the green bars get highlighted. So in this case, you see a power cut, one or more happened in August.
	It affected these processing stages here and the micro organism, in particular, that was seen most in those processes was strep.
	So the value here is that when we keep adding issues to this repository from a factor perspective will be able to then
	look into the future to see the risks to the business when something happens, so we know the next time we have a power cut that we need to maybe increase the amount of testing in this stage or process because there's a higher risk of strep occuring.
	Sometimes you get to the point where you have so many bars that you need to add in you know seasonality and smoother
	lines as well to to add more visibility. This is a challenge we have in the business, which I'll touch on later, is the seasonality element of the business.
	And we are very much, because the cows eat the grass and the grass is used, then obviously to make the milk, ultimately, and because, in Ireland, the weather is so unreliable.
	It can be raining one day and snowing the next and sunshine the next day, so from a seasonality perspective, where each of the bar colors here are used to represent a different year, so 2015, 16, 17, 18.
	And it's just the weekends as well, I should add, and it's looking at overall weekend orders into the middle for a green.
	You can see that years 2016 and 17 are somewhat similar in that they have this m shape, whereas 2018 here is very, very different.
	So as an example here's where we're all throwing snowballs at each other and six weeks later, there was a heatwave, with a water hose ban.
	So very, very difficult to use seasonality data that's effectively weather dependent within the business, but that is one of the challenges that we try and overcome.
	So, from a trend...keeping with the trend perspective, we use JMP as well for non project driven exploration. So here's an example of 2,800 milk supplier average monthly of
	both protein and fat results coming into the business over a two year period. So every tick mark here represents, in this case 2018 September, 2018 October, and so on.
	So it shows the range of variation that exists, both within and between our our suppliers. And what's useful, is that we can zone in, and we can say in particular for this guy, as an example, why is it that this guy is so high?
	And he's high consistently for both fat and protein and is there something that we can learn from this supplier that we can maybe bring to other suppliers?
	Or, conversely, is there other suppliers down at the lower end
	that could do with some support and some help from our milk advisors because they're consistently unable to get their protein and their fat levels in their milk
	high. These these are important, because these are the parameters that the farmers get paid on, the level of protein and fat.
	So in this case it turned out to this guy in particular was one of the very few milk suppliers who chooses to milk only once a day.
	So, it means that his fat and protein levels are actually higher than everybody else, but he has less volume to be giving back to Dairygold, but and that's his choice.
	parallel plots.
	And these are my current favorite because I think they just show so much information on the one page, at the same time, for so many different parameters.
	This is a problem that's actually occurring right now, so this is literally hot off the press and the problem hasn't actually even been resolved yet.
	So what we're looking at here is...it's relating to intermittent powder blockages on fluid bed dryers and cyclones,
	which basically means that powder is blocking up the the dryers and the cyclones. And we call these chokes, OK, so the equipment is choked and we have to stop the equipment and clean them out.
	So what we're actually looking at here is
	30 different parameters, all measured on...all displayed on the X axis here.
	And each of the lines going from left to right represent one minute in time for the settings of each of these 30 parameters. The red indicates the
	minutes that are in the one hour pre choke occurring, so these are, these are the level settings, if you like, for 60
	occurences...60 minutes before any choke occurs, and this red group here represent nine different choke occurrences.
	Okay, so there's whatever is not 540 strands up here, so this is effectively the fingerprint of what occurs, just before a choke happens.
	And when...what we're trying to do is compare that to typical normal operating production scenarios from a machine parameter perspective.
	And it's really good, because you can see how they compare and even better when you can overlay them on top of each other as well, and I have...the no choke is ghosted in the background. So now what's interesting is that we can actually see
	how the choke scenario differs from the no choke and understand what's happening. So as an example, a question that's being explored the moment is,
	you know, we seem to have lower levels of whatever this is...differential pressure.
	And for the most part, when we get a choke, is this a potential contributing factor that we should be looking out for? Is this something that's causing it? Because typical order operating
	parameters when things are good for differential pressure are high as well here. So again, just one example.
	Using GPS data to visualize patterns and trends in supplier location, so what we're looking at here is protein levels in December and across the entire region of Munster, where all our suppliers are. And each dot here represents a supplier location within the business.
	And the size of the dot represents the volume of milk that supplier provides Dairygold with, so you know, the bigger the dot means is the bigger the farm really.
	And there's a diverging scale here, where we expect that the volume...sorry that the protein levels should be a 3.5, so we're less interested in that and that's shaded in gray.
	But anything greater than than 3.5 is in green so that's...they're the guys that are doing really well, and in red are the suppliers that are maybe,
	excuse me, that are maybe struggling a little bit to get their protein levels higher. So this is where we can use our own data to maybe reach out and see if there are suppliers who might need some support from our milk advisors and we can help them as well, so very useful.
	Sticking with the geographical element, we we've been able to put some age old myths to bed as well. So these are the
	discussions that might take place at the water cooler whereby, somebody would say this thing, NPN, is definitely geographically affected.
	NPN is a difficult enough thing to explain so I'm going to use the analogy here of a pint, a pint of beer if you like. It's pretty more of a glass, I think, really.
	And so, as everybody knows, milk has protein, fat, lactose, minerals, vitamins. Now, these are good things.
	And, but there's lots of other stuff in it as well, and if we're just looking at protein as a whole, you can consider the whole pint to be protein that we get in the milk from the farmers.
	But that protein can be broken up into the true protein which is basically that the liquid part of the drink here.
	And it's this bit that we can actually use in our manufacturing process. The head of the pint, the frothy bit here, is the other stuff, so the urea and the uric acid, and so on.
	While we pay our suppliers for all of the protein, unfortunately we can't use all of the protein in our process, so we can only use this piece here, the real bit, or like the true protein.
	So depending on the process, it can be the case that if we have higher or lower levels of NPN, non protein nitrogen, at any given time of the year,
	then it might be best to direct that product to a particular site
	now over another site, so we can be maybe strategic. And the question has always been, you know, I bet you that at a particular time of the year, geographical location will impact non protein nitrogen levels across
	the geographical locations. And what you're seeing here is data that came from 284,000 data points, so 284,000 measures of NPN,
	non protein nitrogen, across an entire year, across six different regions in Munster in southern Ireland.
	And the lines here represent the means and the difference in means that exists from month to month, but also the means that differ from a color perspective from region to region.
	The box plots here represent the amount of variation that exists within and between, again, each of the regions, so clearly, you can see that the averages are following
	along nicely with each other for each region and also any time that the amount of variation reduces,
	let's say in July here, it appears to reduce across all regions, except for maybe a slight anomaly here for Tipperary, in this case.
	But for the most part, when the variation increases in a region as well, it increases across all regions.
	So we can no longer...we no longer need to be talking about the potential of this that and the other. We're able to prove and disprove theories very quickly with our data as well.
	Every five years we send out the midline(?) census to our mix suppliers and it helps us to understand future volume intentions and challenges that suppliers might have on their farm.
	And it gives an indication of sentiment from a business direction and from from an investment perspective, as well, from our members.
	So recently we sent out over 100 questions to each of our 2,800 suppliers and, in this case, the question we asked was, what are the growth limiting factors on your farm?
	And people were given nine options, so in this case, factor one to nine. So they were asked to rank in order of most or least significance, so from one to nine, those factors as presented to them. So as you can imagine, you get back 2,800 results
	from each supplier but each result has a score of one to nine and some of them didn't always score one to nine, they may have only done their top three, others maybe their top five or top seven or whatever.
	So the question is, how can you best represent all of that data those 25,000 data points in one clear picture that represents the entire farming community reporting into Dairygold, if you like.
	And the results are...so the answer here is it simply a mosaic plot. You can see here, hopefully, very clearly that 48.8% of respondents, of our suppliers, have ranked Factor 4 as being the most significant
	limiting factor to expansion on their farms. That's followed then behind by Factor 5, where 19% voted as a number one most significant and 23% voted
	as number two. You can see again the factors that are most important to farmers as we go from left to right.
	So color is really important here and I'm following through with that team here with stratified heat maps, again, for a very different part
	retail orders. So in this case we're looking at the number of orders placed by time of the day at weekends only
	for feed into our stores. So this is for Friday, Saturday and Sunday for each of four different years 2015 to 2018.
	And for the time of the day that the orders are sent in. Now orders can be sent in by by farmers on apps and...or
	they can ring in the order. But you can see here clearly on Fridays, you know, when the first
	milking occurs of the cows, some farmers are going out and, right at five in the morning, saying okay, I need more feed. I'm going to place an order here automatically in the app.
	And again, it might happen again, maybe at 12 o'clock or so and it depends on the the milking time of the herd, and so on. But the value here is that we can
	understand what are the busiest times of the weekend for receiving orders and when orders need to be processed, and it can enable us to understand
	how best to have effective staffing levels across the weekend, so there's no point in having the same number of people in on a Sunday as there is on the Friday. Maaybe more
	Ttme should be spent focused...and more headcount should be focused on on getting orders on a Friday and maybe early on Saturday morning as well. So it was much more effective staffing levels at our stores.
	Again, keeping with the theme of color and, as we know, fancy analytics yet as you notice we'll get there shortly,
	but keep with the theme of color, and what you're looking at here is the output where we've graphically represented a large bed of cheese.
	So when I say large it's probably 25 meters in length and probably two meters wide. So all of the curd has been drained of the whey at this point, and then this bed of cheese is is squashed and it's sliced into 70 strips.
	And then the strips are cut into five blocks and then what we've done is we've measured the height of the blocks on the left, in the middle, and on the right hand side.
	So what we're trying to do is get a gauge on the height of the cheese bed for this big huge slab of cheese.
	And once we were able to do that, we could get a really good picture of what's actually happening and using color as a guide, as well to width in our cheese bed.
	So, clearly the red means the higher, green means little bit lower and we're measuring here in millimeters so it's 120 millimeters to 155 millimeters.
	Clearly, you can see here that there's a bit of a slope when you go from the upper left-hand
	side, where the heights are higher down towards to the left side and to the lower left side as well, where the bed is actually falling off.
	So this is really important to us because, from a variability perspective, what we're trying to understand is, is there a different...why is there a difference in in the blocks...
	in the weight of blocks going to our process? This is pretty much pointing to one of the reasons. And one of the reasons is that if the height of the blocks are going to be different,
	then logic would dictate that the weight of the blocks are also going to be different if the cheese has been compressed to the same...by the same force.
	So, again it's it's easier for people to understand that in pictures, particularly if they're not very versed in the analytic side of the business.
	Going more into the analytics, I suppose, look, we've used statistical process control a lot across business and to,
	you know, look for opportunities, as well as verify improvements. Here we're looking at carbon dioxide usage and we use carbon dioxide in site in our products and particularly in our powders,
	when we're bagging off powders to extend the shelf life of products, which is a standard practice in industry.
	So we expel any of the the air and we add in carbon dioxide when we're filling the bags and, but you can see here that, and then we seal the bags, but you can see here, we had a lot of variation, particular in 2017.
	And we were able to get that process under control, and we were able to ensure that by 2018 life was very, very different and
	where our mean usage of carbon dioxide is reduced significantly. We still have some challenges and maybe even a pattern emerging here as well that warrant an investigation, but
	there's knock(?) on benefits to this kind of thing as well, because as long...as well as getting our process under control and have been able to statistically demonstrate same, we're also able to create some
	studies to effectively assess the rate of absorption of the carbon dioxide into our powders as well.
	So, and this is really important, because if we can understand how our product is absorbing the carbon dioxide, then it leads to other advantages, so one being
	when the fat absorbs the carbon dioxide in a bag, it creates a vacuum inside in this sealed bag, so when it creates a vacuum
	the bag starts to seal around the powder and it makes the bag of powder much more rigid.
	So now, from a supply chain perspective it reduces the risks and makes life easier for when we can stack these bags on pallets for transport and shipping because.
	now it's like we're stacking blocks of of concrete, as opposed to trying to stack pillows which are soft and fluffy and and flexible. So there's win, win all around. We have
	used and I'm definitely not a coder, but we have definitely reached out and got a lot of help from the Irish users group
	and to the JMP Community... And actually, you know what, I'm going to point out Troy Hickey at Intel, in particular, who has been very useful helping us out from the Irish users group.
	At JMP community, it's brilliant when you reach out and ask a question, in an area that maybe you're looking for some clarity
	or maybe an area you don't know anything about, and to get the response back from people that you've never heard about, an expert responses as well as excellent.
	So in this case here, we were able to create some code using some JSL and we were able to create this and control chart
	format, if you like, which is in tabbed format. So in this case, for skim milk powder, whole milk powder, demineralized whey powder and casein for measuring protein
	tests in our in our labs. We have about 30 different tests at the moment that we're monitoring and we're monitoring the difference that exists between standards and wet chemistry
	on a daily basis, and also the difference that exists between manual methods and near infrared methods, so the automated methods.
	And we want to ensure, as an example, let's say if we're looking at, you know, standards on wet chemistry, we want to aim for a target of zero. We don't want there to be any difference between the results we test in a lab and what the
	standard says it should be. So we want a target of zero and this shows us that, in this case where out for
	protein, on demineralized whey powder, we're out slightly; we're .04 and we can keep track of the profile of our process and we can get
	the process to flag to us that a process shift has occurred. So nine data...at least nine data points in a row above the mean, in this case.
	And we know, then, to stop the process and not continue anymore, because we need to investigate and put something in place to rectify...rectify the problem, the process shift in this case.
	So this has been hugely positive for us and we're at a stage where we have this setup now and it's running every single day, and despite there being,
	you know, hundreds and hundreds of hundreds, thousands, even at this point of data points, the script is set up, so that it only shows the last 42 data points, which is last week of data at any one time, even though the control limits are set up to
	based on previous data in the process. So really good reaching out to JMP Community here. In terms of
	the truck arrival times into the business, so in this case we have 54,000 tankers arriving into four different sites all across, you know, from midnight to midnight.
	And it's really good to give us a sense of the different distributions that exists, from a delivery time into the sites. And this is important for us, because we don't want...we want to have a tanker coming to the site
	and offload the milk and leave the site as quickly as possible. We don't want to have the case where we have
	a lot of bunching and maybe at a site, you might have in scenario here, whereby you have trucks lined up behind other trucks. We want it to be a nice flow, all the way through the whole process.
	And here, you can see as an example, eight o'clock in the morning till nine o'clock,
	that's where, you know, people are probably gone for breakfast. Our traffic is heavier and there's not as much delivery into the sites, and same here, traffic is heavier in the evening
	and trucks are held up or they're gone for....the drivers are gone for dinner and so on. So we're able to dig down a little bit deeper into that and say okay for each of the different sites, so two sites, in this case Clonmel Road and Castlefarm,
	for the roads arriving, and these sites are really close to each other, for the roads arriving in north, south, east and west bound,
	what does it look like and are we having a clear distribution across the data that we want or
	is there some hauliers there aren't delivering across the 24 hours from the east, as an example. And we can understand what's going on, and we can take that information as well,
	and say, okay from a pump offload rate, because we have in this case, two sites that have 11 pumps that are offloading
	over half a billion litres across two sites, a year.
	And we want to try and understand which pumps are more or less efficient than others. And, clearly, we can see here that there are four pumps that have a higher offload rate,
	so they're able to pump faster and more milk in the same amount of time then maybe some of the other pumps. And you can see here that
	on Site 2, Pump number three is the pump that is most efficient, whereas overall Pump 4 and Pump 3 are the least efficient pumps. So if we were going to have a maintenance overhaul, then we would know that these are the pumps that needs to be overhauled first.
	We use JMP to assess measurement systems for both excellence and opportunities, and we do this, you know,
	so many times, I can't even keep track anymore. In this case, looking at orthophosphates(?), and you know, it's a standard Gauge R&R and it's really good, the results. So there
	19 number of distinct categories; 19 is excellent and Gauge R&R is .5, so less than 10 is ideal and it just means that you know we have really good
	measurement systems in place that are consistently reliable. And regardless of the part you are going to hand any analyst, they're going to come back and give you the same result.
	So the person performing the test isn't influencing the results for tests, which was what you always want to see.
	Sometimes we can have subjective measurement systems and we need to do attribute agreement analysis as well.
	So in this case we're looking at you know visual assessments of of packaged products, and you can be looking for something like roundness or damaged edges or the embossing
	legibility and so on. So after improvements were made, you can see here that we're, you know, we have 90 to 95% agreement within appraisers.
	Overall, you can see, the level of agreement is 82% in terms of the number of parts matched and agreed by
	within each...within and between each operator, but probably more importantly, is the effectiveness, which is the alignment to standard.
	And we can see here afterwards that agreeing with the standard is now 96%, so it's really good. And again, you know we can we can ensure that our products are are reliably graded even when they are subjective.
	We obviously can design experiments as well to quantify the effect of various factors. So in this case, we're looking at WPNI, which is whey protein nitrogen index, so this is a level of the measure of.
	protein denaturation in our product so it's an indication of the level of heat treatment that the product has gotten, and that can be,
	excuse me, that can be really important because level of heat treatment can affect the application from
	from a customer perspective. So if, let's say, the wrong level of heat treatment was applied, it may mean that a customer that might be wanting to use our product to make into, maybe, yogurt,
	then the product wouldn't be suitable because it might leave maybe a grainy feel, a grainy mouth feel, maybe or something, so that's not suitable for a product like yogurt.
	So we were having issues with inconsistent whey protein nitrogen index results, and there was some ambiguity noticed in our in our procedure,
	in that it was open to interpretation. So it might say, you know shake something,
	shake this vial, you know, it might just say shake it, and somebody might say, yeah, I shake it for five seconds, somebody else might say, I just inverted 10 times, and someone else would say, yeah, I shake vigorously for 30 seconds.
	So everybody, with the best of intentions was following the procedure, but if the procedure's ambiguous, then it's open to interpretation and that can lead to inconsistency as well.
	So what we're trying to do is design trials to understand the factors that can influence variation in our process.
	So here we have you know sodium chloride solution, so just salt water solution, made up with a starting bar and without starting bar.
	The duration on a hot place is 45 or 60 minutes and then using that solution here for, and this little test plan on here, we were...water bath temperature is 37 and 40 degrees, this is one of the four solutions up here color coded.
	And then we have water bath shaking intervals and number of drops of I think was hydrochloric acid being used as well.
	And again we're able to use the interactive explorer to see when we have low WPNI results, what is going on with the process. A bit
	more difficult to understand what's going on here, so obviously we need to go a little bit further and start building some models and understanding what's going on.
	The models are quite good here in terms of strengths that we're talking about, R squared is just at about 68%.
	And we can see that straight off, the main effects here, biggest main effect is water bath temperature, followed then by stirring bar duration and hotplate interaction effect.
	And we can see the effect as well, what parameter is pulling...which effect is pulling
	the process in which direction as well. And we have the valuable prediction profiler as well on here on the bottom. We can explore what's going on.
	What I like, as well as that you can use the output, sometimes, well, I use the output sometimes from some of these experiments to...from an education perspective.
	So I spoke earlier about, you know, stirring bar and duration on hotplate being an important factor, and this is an interaction effect, so this is when the effect of one factor is dependent on the level setting of another.
	But you can, you know, you can try and explain to somebody on a page the interaction effect; we have crossed lines and a crossed lines mean there's
	strong interaction, and if the lines are in parallel, but they look like they're going to cross, it's a small interaction. Or you can actually take the data and plot it,
	and in this case, a lot of people haven't actually seen what, you know, an interaction looks like with two factors, and this is what an interaction...interaction looks like and it can be used as a...
	as an educational tool as well. So, in our case here, we have WPNI result here on our y axis, sorry, yeah, on our y axis,
	duration on hotplate a 45 minutes and 60 minutes, so if we stick at 45 minutes, where we don't have a stirring bar and if we switch...we stay with 45 minutes and we switched to a stirring bar, we can see that we jumped from this level to this level.
	Whereas when we go with 60 minutes, we're...WPNI level is here, with no stirring bar and we jump to a stirring bar, we can see it jumping down. So
	the jump isn't as much here as it is here, but right away, this is showing that the interaction affecting stirring bar and duration on hotplate is quite big.
	And that's what this is coming through as, so again, a lot of process knowledge and understanding. We can use some classification tools as well, so in this case decision trees to try and understand.
	So this is a query from R&D in relation to cheese browning and it is, as it as it sounds as well, it's related to the browning of cheese
	in an oven when heat is applied so bit, like on pizzas. So we don't want cheese to burn too soon so if it's cheddar cheese on a pizza, you don't want it to burn too soon.
	You want it to have some bit of life and the recipe to be cooked. So again, and here's an example of a classification, sorry, a decision tree, where we are looking at
	three different levels, if you like, light, medium and dark browning represented by the tree, the color is green,
	amber and red, respectively. And we can see that pH at running is the most important factor here, in particular, if pH at running is greater than 6.48,
	then we can see that 99.3% of our process is going to have, you know, light browning, which is what we want. Green is good, red is bad. Similarly, we know that we're at risk if our pH is less than 6.48 and we have...
	and we take our samples, so the cheese in our block, in our literally 20 kg block, is taken from a corner, then we know that we have a 31% chance of getting
	dark browning, so it's not going to be good. So from this perspective, and we can we can see that the size of the
	counts that are the number of objects are causing that as well, and in terms of the termination of the leaf nodes on the tree as well. So you can get
	a good sense of what's going on your process, in terms of the important factors that are best separating, in this case, the browning classification.
	And I say covert bootstrap forest here, because you know I think this is an underutilized tool within JMP and the...if you understand what a decision tree does and how it works, you'll understand it's quite a powerful tool.
	And I'm not sure, if I'm honest, I'm not sure if the algorithm that's used for decision trees, whether it's iterative dichotomizer or
	C 4.5 or 5.0 or whatever is, but what I do know is that the predictor screening utilizes the decision tree algorithm in the background and uses bootstrap forest, which is
	sampling with replacement as well, to create a bunch of these trees and stick them all together and then give you the the the average results, if you like, of the trees. So in this case, pointing out that, and this is from the
	parallel plot example earlier for the 30 different machine parameters, in this case, you notice there's actually 31,
	because when I went back to the team, and I said there's something different about this nozzle air pressure and it's it seems to be an important factor and the predictors screening is telling me that.
	A hand went up at the back of the room and they said actually,
	of all the parameters, that's probably not one that should have been in there because it's not really machine parameter, we just stuck it in ourselves to see if anything would would jump out.
	And sure enough, this is the one that jumped off the page, so it had to be taken out and assessed separately.
	In terms of, and I'm nearly there as well, in terms of presentation, so in terms of them
	using models to predict future performance, this is where we're moving from insights to foresight.
	And here's where we created a model that's fairly strong at 76% model effectiveness, based on 1,438 observations.
	The question was can we build a tool to help us predict both the number of orders that come into the business on any day of the week, any week of the year and
	also predict the number of orders that we would get...sorry, also predict the order quantities that we will get on that same day?
	So we were able to take all of the historical data and use the prediction profiler to select which year from a seasonality perspective is most common to what you're trying to predict.
	And then go to the applicable month, whatever date you want, and the day of the week and then determine whether it's a bank holiday or not.
	And you can see here, I have two different settings set up here. One is for when we have a bank holiday, one when we don't, and you can see that there's a significant
	interaction effect here between bank holiday and day of the week as well. So we can go back to the business say on
	whatever day of the week, on the 26th of October of 2019, which is a Thursday and not a bank holiday weekend, you expect 120 orders to come in and it's going to equate to just under 1,100 metric tons of product coming in, as well.
	So we've actually run this over two weekends, a weekend where we had a bank holiday and didn't have bank holiday, and it actually was 85%, which is really good. So even though the model was 76%, it actually came up quite good in reality.
	Then from insight to foresight, where we're again looking...trying to account for season veriability,
	we have different years here, represented by different colors, so 2015, 16, 17, 18 and so on. And we're looking at a number...some of the feed quantity orders coming into
	the business. Each dot represents a weekly order quantity and what I've overdone here, is I've overlaid the actual predicted model, so that the value that was predicted
	each of those weeks. And you can see how the model is in terms of accuracy to actual, so actual versus predicted.
	And then being able to use, and if I'm honest I can't remember...it was either seasonal exponential forecasting or our seasonal ARIMA,
	seasonal ??? metric moving average to determine the next six months, what that's going to be like.
	And again hugely valuable to our business where seasonality is is a challenge, and we can use these tools, with a high degree of confidence to understand what level of ordering we can expect into the business going forward.
	And lastly, and I suppose a stretch, for us, because we haven't gone into the zone before, but neural networks. So using neural networks, in this case, as a trial to understand if we can grade out sediment in our process. So if we have ADPI results, so this is
	a sediment pad and we're looking at an amount of possibly scorched particles, maybe from a drying process or so on, and where where are we? Is it a level A, B, C or D?
	Here the sieve result represents the number of particles left behind in a sieve and a tumbler results T1B1 represents top and bottom.
	T1 represents number of particles that might be floating on top of the water and B might represent the number of particles at the bottom, so the sediment that sunk to the bottom.
	And again using it in neural network here and with K fold cross validation, so I think K was set at 10. We have nine
	activation functions here as well, I think. Yeah hyperbolic....linear, gaussian, and I think, hyperbolic tangent as well.
	And we can use each of these
	inlet nodes as well to feed into our activation function, and then we have our output layer as well here, which has given us the final result. So we're able to use,
	in this case, 6,245 results to
	train our model and then, once we validated with using the Cross validation as well, we can create our confusion matrix here, which is quite good as well. I mean we don't have as many values in the A1/A2
	down to B grades, as we do for A1, but it is kind of showing us that, you know, if we do have an A2 grade
	that we predicted that it would 35 of them would be a true two and only two would be A1. so there can be some error and the error can come from how the model was was,
	you know, initially created by graders who were manually trying to assess the difference that exists in each of these if and when
	some ambiguity comes up in in the results and then trying to make a final call. So I suppose the question here is, can we get a machine to predict what that final call will be with a good degree of accuracy?
	And again, we would expect that the probability of getting a grade assigned as A1 up here, would expect it to be quite high if tumbler result was T1B1, hardly any floaters or sinkers
	and if the sieve results were quite low as well, in terms of the amount of particles, so we would expect to have our results up here for A1.
	The fact we're seeing A1 here could could indicate that the model training wasn't quite as effective as it could have been and there might have been some errors in the model.
	And again, you can see that a very...quite a complex model that was created here
	for this, but you know it's a bit of a black box, as I think we all know from a neural network perspective. But again, huge amount of process knowledge and proof of concept for us, I suppose, really as we start edging towards the whole machine learning space and AI.
	So I know I fired a lot at people, and it was fairly rushed, and I could probably speak for hours on the whole thing and apology...apologies if it did feel rushed, but what I would say is just from a concluding thoughts perspective,
	I think it's been demonstrated that, no matter what industry you're in, even if there is in the agricultural industry, which is very far removed from the typical data analytics environment, in any industry, you can use
	data analytics to add significant value to the business, and you can reach out and help almost every part of the business.
	Also, I would say don't underestimate the value of simple graphical analysis. It was only halfway through this presentation that we started to introduce some of the more powerful analysis that is embedded within JMP and
	I think it's it's great to know how to use all the fancy tools and there's a time and place for it, but, you know, simple can be really powerful too, and don't underestimate the value of it.
	Other than that, I would say thank you all for listening. I really appreciate it and if you have any questions or comments, they're more than welcome now. Thank you.

Presented At Discovery Summit Europe 2021

Presenter

Kieran O'Mahony

Milking the Data at Dairygold: From Bar Charts to Neural Networks (2021-EU-30MP-735)

Presenter

Advanced Statistical Modeling

Basic Data Analysis and Modeling

Data Exploration and Visualization

Predictive Modeling and Machine Learning

Quality and Process Engineering