LEGO: Predicting New Sets That Have a Better Return on Investment Than Gold
LEGO has had a long history of manufacturing toys, particularly in the development of their branded interlocking plastic bricks. Over the years, this history has created generations of block-building enthusiasts who grew up with LEGO and continue to collect new sets into adulthood. It's no wonder that LEGO has become a semi-mainstream investment, as a study published in Science Direct in 2022 reported these sets have a better rate of return than even gold. Is it possible to predict which sets are worth purchasing over others?
In our study, we calculated return on investment (ROI) for each set based on the original USD MSRP of the set vs the current price it fetches. We then compared the LEGO set ROI against ROI for gold calculated from the average yearly opening/closing stock price for gold since 2014. The model then determined if a LEGO should be purchased if the set's ROI was greater than gold's.
Of five tested, XGBoost was the winning model, as the model predicted it performed 1.8 times better than a random selection in the top 30% of LEGO sets.
Our research proved that many LEGO sets have a higher ROI than gold; our model can even predict which sets will have higher ROI. Collectors can use our model to determine if a purchase will be a worthwhile investment, and even novices can feel confident that they are making a valuable purchase.
This research used JMP Pro 17.
My name is Peter Vo.
I'm Keven Thunderberg.
Today, we're here to talk about the return on investment for LEGO versus gold. After this introduction, we'll be going over the data understanding, data preparation, the results, and then the conclusion. What's the first thing that comes to mind when you're looking at a LEGO set? For me, I can't help but just marvel at the cool factor of each set. But the second thing that I look at is the price tag.
A lot of new sets nowadays are getting up to $100 or even several hundred dollars. Back in 2022, Science Direct reported that some LEGO sets were worth even more than the price of gold. That begged the question, what considerations go into the retail price and market price of a LEGO set? Some of it has to do with maybe the generational presence of LEGO, the brand tie-ins, piece count, et cetera. But you're not guaranteed to make a profit just by purchasing an expensive set and sitting on it for a few years. How do we make that decision of whether or not to buy a LEGO set? We predict when we should buy a LEGO set. That's what we're trying to answer today.
Specifically, we want to determine, one, what variables, if any, contribute to the MSRP price for a LEGO set. Two, we want to create a model that calculates return on investment for a LEGO set. Three, we want to compare the return on investment for a LEGO set versus the return on investment for gold to determine if a LEGO set is worth investing in. I'll throw it over to Keven now to talk about our data understanding.
Thank you, Peter. The first step in a machine learning project is to look at what data we're even going to work with. It's very important to understand what our data even is, so we know how we can manipulate it if needed, and just to know what we are going to be modeling in particular. What we decided to do was look at LEGO sets from the past 10 years in particular. Our data set is a lot larger than that dating back to the 1970s. But as Peter will say in the next section, it wasn't the best back there.
To filter it down to newer sets over the past 10 years, we wanted to focus on the new sets that come out that will help our prediction there. Most sets have an MSRP or a suggested price of under $50, and we'll show that in figure 1 in just a little bit. Other variables included. There's the number of pieces in the set that can range from just a few pieces for something that is just a minifig to thousands and thousands of pieces for some of the giant sets out there. We believe that is a big predictor of MSRP.
The number of minifigures in the set. Of course, the minifigures are the little LEGO guys, little cute yellow guys. Some sets can have zero, some sets can have tens of those. Our data came from brikset.com, so the number of people who own the set and then the rating of the set, the star rating from 0-5, are from there. The number of instructions of the set and availability of the set: those were maybe a little bit less important, but we wanted to throw them in as well.
The availability in particular is whether it is from LEGO directly or if it's available at retail, everywhere, or if it's available just at LEGOLAND. There were a few like that. Finally, the theme group of the set was the most interesting one to think about, we thought. This is whether it's a licensed set. Harry Potter and Star Wars would both be licensed sets, versus LEGO DUPLO, the really big ones, those would be under the preschool theme group. We wanted to include that because we thought that might actually be a big prediction. Maybe some licensed sets are worth more than non-licensed sets. We don't know.
Going into what we did with the data next, we'll go back to Peter.
Thank you, Keven. We started out with nearly 15,000 rows of data, which dated back to all the sets that initially released back in the '70s. Not perfect, of course. Back then, they didn't have as stringent of bookkeeping as they do today, which resulted in us having to disqualify several rows from our analysis. Instead of doing that, we decided to focus on just the rows that listed MSRP and current market price.
We used median by MSRP to interpret any missing data points like the mini-figs and the number of pieces, et cetera. Then we removed around 20 outliers based on their extremely high price points. There was at least one that was worth a few thousand dollars, and that reduced our positive skew in our distribution.
At that point, we went gone from 15,000 rows to 3,000. But appending our data set with Brickset, we were able to pump it back up to 8,000. After that, we appended another data set which had the yearly price of gold listed. It listed the opening and closing prices of gold by day, so we took the daily average and then extrapolated the yearly average price from that. After that, we calculated two new variables: ROI for gold and ROI for data sets, which gave us two values that we could compare with our third column called buy. Our buy column would tell us to buy the LEGO set if it's predicted ROI would be greater than that of gold.
Now I'll throw it over to Keven to talk to us about the results.
All right. Thanks again, Peter. We ran a whole lot of analytical models within JMP. Rather than bore you with the details of all of that, we'll just go to best model straightaways, which ended up being XGBoost, which is a gradient boosted decision tree model. We use that via an add-in that is in our references.
Before we go into that in particular, though, I wanted to talk about that figure 1 I mentioned before. This is MSRP. I'm zooming into it here. You can see that everything is skewed towards very low prices. We don't have very many examples of high-priced LEGO sets. Everything that first column is between $0 and $50.
In our results, here is our confusion matrix. If you're unfamiliar with machine learning, the important parts to look at are the true negatives, the true positives here, as well as we listed our accuracy, and our F1 score. For this model in particular, we did want to look at and focus on the F1 score. F1 score is the harmonic mean of precision and recall. It basically is a really good metric on showing the true positives in a little bit of a different light.
We want to predict the yeses. We want to predict if it's going to be worth more than gold or if it's going to have a higher ROI than gold. With that being our metric, we went into other charts here. We went into our lift chart here. You can see that there is about this top 30%, that's what that 0.3 is. It's 1.8 times better than random chance at choosing the yes category. That's essentially 1.8 times better than just slipping a coin having 50-50.
The ROC chart here—or receiver operating characteristic curve, as it's called—shows the sensitivity versus the specificity here. You can see we can set an area to be about that same 30%, and it's covering about 80% of true positive cases. That's going to be what we decided our metric was for deciding if it was a good choice or not.
But let's test it out. Here's a new set that I really enjoy. It's the Barad-dûr set from Lord of the Rings. It was just released in June this year. Here are all the data that we put into the model. We grabbed our model, put it into the formula depot here, and ran that data through it. It came up with these two numbers here.
In other words, the model is 54% confident that this set will have a higher ROI than gold. That means it is not going to be within that top 30% that I mentioned. It's not in this area here. It's going to be more, a little bit further out. With that being the case, if I was investing in a set, it would probably not be this one. I would want to run other new sets through the model and see if they come up with something a little bit higher.
To conclude, not only have we shown that many of the LEGO brick sets have a higher ROI than gold, but we created that XGBoost model to predict whether it will or not. Using our model, an investor, they can not only have really good guidance on what models to invest in in the future, but if it's within that top 30%, they can say, "Okay, this is a pretty solid choice for a good investment." We'll go back to Peter for our references.
We wanted to thank our mentor, Kedar Adavadkar, for giving us guidance and keeping our project in scope, a number of people were reaching too high with the number of features that we wanted to add. We also wanted to give thanks to Dr. Magan and Dr. C for their advice as well.
I want to give a special thanks to JMP for accepting our poster. We are really looking forward to going up to Cary in October and meeting all of you. Thanks again.
Thank you.