Exploring text to see what might happen in the stock market
When it comes to investing, my favorite strategy is the Boglehead approach developed by founder and former CEO of Vanguard, Jack Bogle. He recommends investing in low-cost index funds that attempt to match the markets’ returns.
For this blog post, I decided to see if I could come up with a group of stocks to outperform the Dow Jones index.
The goal is to pick a few of the components of the Dow Jones Industrial Average that would outperform the entire index over a fiscal quarter. To make my selection of high-performing stocks, I used earnings call transcripts from the past three years. The transcripts were coupled with historical stock prices to make a statistical model and project the stock price for the end of the next quarter based on what was said in the earnings call.
Dealing with Unstructured Data
Earnings call data is in the form of unstructured text data. This unstructured text data presents some unique challenges when you try to garner information from it. Some of these challenges include usability, relevance and quality of the data. The Text Explorer platform in JMP 13 offers nice tools to deal with these challenges. I will show off some of those tools in this blog entry.
The earning call transcripts for each of the 30 companies that make up the Dow Jones Industrial Average can be found online. The most recent earnings call for each company occurred sometime during the fall, ranging from 9/27/2016 for Nike to 11/17/2016 for Walmart. For the historical stock price information, I used the Yahoo finance fetcher JMP add-in. For the current stock price, I used the opening price for the day of the earnings call.
Exploring the Text
Once the data was imported into JMP, I used the Text Explorer to parse the data, added stop words and found common phrases. For the data parsing, I used the default settings in text explorer, which uses regular expressions. I also turned on the stemming for combining option because this combines terms that have the same root word. You can see an example of stemming in the screenshot below (from Walt Disney’s transcripts). The first term in the term list is franchis· which contains all of the words that have a franchis root word. For this example, those words are franchise, franchised, and franchises. Anytime you see the · symbol, it is an indication that stemming has been applied. The stop words that I added to this analysis are words that don't offer explanatory power. A few that were used in this case were webcast, quarter, and good morning. Common phrases in the documents appear on the phrase list shown below. I added these common phrases to the analysis. Certain phrases like vice president highlighted by the red font are added to the analysis by default; these can be removed by adding them as stop words.
After this process was complete, I saved out the document term matrix (DTM), which converts the unstructured text data into a usable numeric structure. I also performed topic analysis in the Text Explorer platform (available in JMP Pro). Topic analysis uses a rotated singular value decomposition (SVD) of the document term matrix to group topics to specific words or phrases. I then saved the topic vectors to the data table as well. Below is an example of the topic analysis output performed on the Apple transcripts. The topic words are the words most representative of each topic, and they give me a good idea of themes for each topic. These topics can help add explanatory power to the models.
Building Statistical Models
The next step is to make a statistical model for each component of the Dow Jones index. Before I built the statistical models, I set up a validation column to hold back a portion of the data. In this case, I used the last year of data as the validation set. This technique is to help guard against overfitting the models. One thing to keep in mind is that some of the validation data could have leaked into the training set from the use of Text Explorer. Also with this small amount of data, you might consider using a k-fold cross validation method in place of the validation column.
The data I am using for this exploration has lots of columns and not a lot of rows, also called wide and narrow data. JMP and JMP Pro offer a wide range of modeling techniques that can handle this type of data. I tried out a number of different models, including decision trees, bootstrap forest, boosted trees, neural networks, and Generalized Regression. Based on the model comparison and the specific data set that I am looking at here, I found that the boosted neural networks had the best blend of performance and parsimony. Below is an example of one model showing the actual by predicted plots.
Actual by predicted plots for training and validation data set
Selecting Stocks for My Index
As I was making these models in JMP, I ran into a couple of models that would not converge Nike and Apple. Looking at the data, I realized that both of them had a very large change in stock price from one quarter to the next. For Apple, this happened in the middle of 2014. For Nike, it was at the end of 2015. As some of you might know, the timing coincides with stocks splits for each of those companies. For modeling purposes, I normalized prices to current levels.
After making all the models, I saved the prediction formula to forecast where the stock price will be at the next earnings call. While the dates of the earnings call vary, most of the companies have their next earnings call at the end of January 2017. Below is a table of the results of all the predicted stock pricing changes. These predictions all come from the boosted neural network models made from the earnings call transcripts. Looking at the results below, I selected United Health Group, Caterpillar, Merck, Goldman Sachs, and McDonald’s. The goal is to see if these picks can outperform the Dow Jones Index from the previous earnings call until the next earnings call around the end of January.
The Dow Jones index is calculated using a price-weighted average. That means that the stocks with higher prices have a bigger weight or influence on the index. I used this same method to create my index and "purchased" one share of each of the five components, with the more expensive components making up a larger portion of the index. To get one share of each component on the day of the last earnings call, I would spend $570.61 (see the table below).
For comparison, I will "invest" the same amount of money in the Dow Jones Index at the end of October around the time of these calls. The Dow Jones Industrial average opened at 18176.60 on Oct. 31; that will be the original time of investment. Assuming I could buy shares of the Dow Jones Industrial Average, I would get 0.03139 shares of the average. The models predict that the $570.61 will increase to $629.89 if invested in the five picks I made using JMP, which is an increase of 10.4% in one quarter. To get an estimate of where the Dow Jones Industrial average will be at the end of January 2017, I took the predicted stock price for each component and got the sum. Then I divided that number by the Dow Divisor, which is 0.14602128057775. The result for that projection is shown below. With an estimated close of 18696.73, the $570.61 investment would increase to $586.93 or 2.86% in one quarter.
The next step is to wait and see how my predictions actually do when compared to the Dow Jones index and revisit this with a blog entry in early February.
There are a few things I would change if I were doing this for actual investing as opposed to a fun project I devised over Thanksgiving week:
I would definitely go back further in time to get more earnings call transcripts. Twelve is a good start, but more data would improve these models.
Since this blog post focuses on showing how Text Explorer in JMP can be used, I used only the unstructured data from the earnings calls. To improve these models, I would also use information like revenue and earnings per share.
Last but not least, I would not pick an election cycle to attempt to predict stock prices in the future!