I have been playing with the database of tweets sent out by Donald Trump (http://www.trumptwitterarchive.com/archive). Looking at the number of tweets per day from the 1st of January, it turns out that they follow a Gamma distribution (who knew).
The green line on the PDF and CDF plots above is the fitted gamma distribution.
Looking at the data, it turns out that it is not as unusual as I thought for Donald to have less then 5 tweets per day. Other than last Friday and Saturday, there have been 6 days since January with 5 or fewer tweets;
The median number of tweets per day is 30.
After fitting the gamma distribution, we can derive the likelihood of any particular number of tweets. The likelihood of the number of tweets on Friday, Saturday, and Sunday are;
On Monday, he had 31 tweets. So far today (Tuesday 10/6), he has had 45 tweets. Clearly he is feeling better.
Combining the 3 likelihoods above, it is 0.0004% likely (4 parts per million) or less with 97.5% confidence that the low number of tweets on Friday through Sunday would occur naturally.
I can not think of a metric better suited to measuring the wellbeing of the president than the number of times he tweets per day. He was very seriously ill. His tweets do not lie.
I am not sure of the rules for combining data points with a particular confidence. It seems that combining 3 probabilities with 97.5% confidence wound result in a probability with much higher than 97.5% confidence. Can anyone shed any light on this? Also, if anyone can see errors in my analysis (there are probably many) feel free to point them out. I have attached the raw data so you can play with it yourselves.
Hi P, I have not tried that. I thought about it, but do not know much about control charts. The raw data is attached to the original post if you want to try it out and post what you find.
Good call @P_Bartell Here is a pic of the I, MR chart. The # of tweats is inconsistent. There are some unusual points in the data. How reliable is the data? How sure are we the data is from the day it is counted in? But certainly the # of tweets was not unusual (and therefore not assignable per Shewhart) when the events recently occurred (the ones you are trying to assign a cause to "feeling better or not").
Thanks @statman. Thanks for putting together the charts. Yes, the points are reliable. They came from http://www.trumptwitterarchive.com/archive as noted in my original posting. I agree that the number of tweets is inconsistent (I would not expect the daily number of tweets from anyone to be consistent), but the numbers are accurate.
I have been thinking about this, and I do not think that an I, MR chart is the right tool for this analysis. Control charts work on parameters that follow a normal distribution. As noted in my posting, the number of tweets do not follow a normal distribution, but a gamma distribution, which is heavily skewed to the right. As the anomaly that I am looking into occurs way over on the left of the distribution it is not surprising that the I, MR chart does not pick it up.
I think that my original analysis is more appropriate than a I, MR chart, but I am open to being wrong about this, and very open to hearing others feedback and suggestions.
Control charts do not assume (or require) a normal distribution to be effective. See Wheeler and Chambers "Understanding Statistical Process Control" or Shewhart "Economic Control of Quality of Manufactured Product" or Deming " Out of the Crisis P. 334-335. For a quick read, Wheeler and Neave's paper "Shewhart's Charts and the Probability Approach, 1996, will clearly make this point.
Regarding your statement "(I would not expect the daily number of tweets from anyone to be consistent)", There is a big difference between inconsistent and consistent with large variation. Why would you not expect random variation in anyone's tweets?
Not to argue, but how do you KNOW the numbers are accurate or reliable? Does that number include tweets that were deleted or censored? Does a message that takes 5 tweets (due to the limitation of length of a tweet) count as 5 or 1? Who is responsible for maintaining the referenced website? Are they unaffiliated with any political organization?
I have done some reading up on control charts, and now understand that the distribution does not have to be normal, but some adjustment is needed if it is non-normal. There is clearly something wrong when a control limit (-22.6 tweets per day for the lower control limit in statman's chart) is not physically possible.
Continuing with the assumption that the daily number of tweets follows a gamma distribution, I selected zone limits based on the same probabilities as those in a normally distributed control chart (15.9% and 84.1% for Zone A, 2.28% and 97.72% for zone B, and 0.135% and 99.865%). The two points at the beginning of October are highlighted as out of control because they violate Rule 3 (2 out of 3 points in Zone A).
Sorry for the confusion about consistent. I am not quite sure what you mean by "The # of tweets is inconsistent". If it is that the points are not distributed normally, I agree, they appear to be distributed with a gamma distribution.
Thanks for requesting that I double check the data source. According to the site itself, it includes all deleted tweets from 1/27/17 onwards. Prior to this, they checked Donald's twitter account hourly, so could miss a tweet that was posted and then quickly deleted. Since 1/27/17 they check the account in real time, so capture tweets before they can be deleted. I do not see it as relevant whether a single message is split among multiple tweets, as this is a part of the natural variability in the data. The data set certainly matches the most recent tweets that I see on twitter. If you can see some bias that I have missed, please share.
To summarize, I think that a control chart is a good method of highlighting areas of the data that should be looked into. I do not think that it is a good tool for generating the specific likelihood of a section of the data. The analysis that I performed seems more appropriate for this, notwithstanding the open question about how to produce a single confidence level from combined data points. If anyone has insight into how to generate such a confidence, or can see flaws in the analysis or better ways to perform it, please share.
Roly, Always happy to help folks "broaden their horizons"!
Perhaps a bit more broadening...Your statement "now understand that the distribution does not have to be normal, but some adjustment is needed if it is non-normal." is incorrect and highlights a major misconception of the control chart methodology (if you disagree, perhaps you can site the section of the literature that supports your argument). If the problem were enumerative, I would be interested in the distribution (and the probabilities associated with it), but since the problem is analytical, the distribution is of little consequence. I suggest you read Deming "On Probability As a Basis For Action", The American Statistician, November, 1975, Vol. 29, No. 4. Admittedly, it would appear more rational if the lower control limit were truncated to 0, but most if not all users recognize this in their assessment of the charts. The software is simply doing the math without regard to interpretation. I would say most if not all statistical outputs need to be interpreted IN CONTEXT!
Regarding your comment:
"Sorry for the confusion about consistent. I am not quite sure what you mean by "The # of tweets is inconsistent". If it is that the points are not distributed normally, I agree, they appear to be distributed with a gamma distribution." I understand this concept is difficult to understand. Perhaps this is the reason so many do not know how to use control charts appropriately. I'll say this again, inconsistency has nothing to do with a distribution. p. 359, Deming, "Out of the Crisis". The beauty of the Shewhart control charts is they assume no distribution. They provide a graphical look at the data in a time series (this is lost in any distributional analysis!). "A distribution only presents accumulated history of performance of a process, nothing about its capability...The capability of a process can be achieved and confirmed by use of a control chart, not by a distribution", Deming, "Out of the Crisis, P. 314. The range chart answers the question: Is the variation within subgroup (due to the variables changing within subgroup) consistent, stable and therefore predictable? Consistent in this reference means within an expected, predictable amount of variation. For the MR chart, the question being answered is similar: Is the variation in consecutive data points (due to the variables changing between consecutive data points) consistent, stable and therefore predictable? So when I say the number of tweets is inconsistent, this means the process of posting tweets is not predictable, not random and unstable. It would be difficult to summarize, with any confidence, this process with any enumerative statistics or make any probability statements as the statistics/probabilites would depend on when you looked at the data.
Regarding the data source...You can't possibly answer who sent the tweets, only that they came for some account registered to "Donald". I do not have direct knowledge of the actual number of tweets, nor who sent them. Is there bias in the measurement process? Are the numbers accurate? I can't answer that. But I could suggest a sampling plan to provide insight.
The distributional analysis with associated probabilities are inappropriate for a process that is inconsistent, non-random and unpredictable.
I leave you with this quote, again from Deming (paper referenced above, :
“Analysis of variance, t-test, confidence intervals, and other statistical techniques taught in the books, however interesting, are inappropriate because they provide no basis for prediction and because they bury the information contained in the order of production. Most if not all computer packages for analysis of data, as they are called, provide flagrant examples of inefficiency.”
Hi Bill. I am finding your postings quite confusing, and took a few days to ruminate on them before responding.
The portion of your most recent posting that I find most perplexing is "You can't possibly answer who sent the tweets, only that they came for some account registered to "Donald"". http://www.trumptwitterarchive.com/archive purports to be an archive of the posts from @realDonaldTrump. When I review the most recent posts listed to those on my twitter feed they match. I have not gone through the 54K postings listed and checked that they all match, but all those that I have checked match. I agree that it is key to check the veracity of data sources. However, I prefer to work with data than with what ifs. If you have data that the archive is not an accurate record of the tweets by @realDonaldTrump, or that the tweets sent by @realDonaldTrump do not come from our President, please share.
You are clearly much more learned that I on control charts, but I am still struggling with your statement that "since the problem is analytical, the distribution is of little consequence". I very much agree that statistical outputs need to be interpreted in context, and one of the contexts that needs to be considered is the distribution that the original data follows. I do not have access to the book or papers that you reference. If you could share links to them, that would be helpful. The reference that I used for how to adjust control charts for non-normal data is https://www.spcforexcel.com/knowledge/variable-control-charts/control-charts-and-non-normal-data. Of the 4 methods that they outline, the 4th (adjusting the zone limits to match their percentiles if the distribution were normal) seemed for me the most appropriate for this situation, and completely resolves my original concern of the lower control limit being physically unachievable.
I did manage to find copy of Wheeler and Neave's paper "Shewhart's Charts and the Probability Approach, 1996 (http://www.spcpress.com/pdf/DJW088.pdf). A very interesting read. Thank you for sharing. From the paper I see that the zone adjustments that I made mean that the chart is no longer a real control chat as Shewhart intended. However, I can not overcome the issue of a control chart producing a control limit that it is not physically possible to achieve (negative tweets per day). This removes the ability for the control chart to distinguish between controlled and uncontrolled variation (its purpose) for data points below the mean. Perhaps we can both agree that a control chart is not a suitable tool to investigate the likelihood of @realDonaldTrump sending only 2 tweets on 10/2 and only 3 tweets on 10/3 (sorry @P_Bartell).
Unless there is more that you would like to discuss I will stop posting to this thread, and set up a new one asking my original question on how to combine the confidence limits for sequential observations.
Thanks for the time you have spent responding to my posts. I really appreciate your feedback.
I'll make this my last response as well. All I can say is I tried. I'm sorry I am having trouble communicating my points. Here are links to reports that tweets from @realDonaldTrump may not be from just the president.
Google it. If you can't trust the measurement system I'm not sure what analysis you have confidence in?
I already gave you multiple sources of the correct use and interpretation of control chart method. And yes there are many, many examples of the inappropriate use of control charts. This quote is from your reference "Have you heard that data must be normally distributed before you can plot the data using a control chart? Quite often you hear this when talking about an individuals control chart. This is a myth. Data do not have to be normally distributed before a control chart can be used – including the individuals control chart. ".
Deming's "Out of the Crisis" (https://mitpress.mit.edu/books/out-crisis)
Deming's paper: https://www.jstor.org/stable/i326388
By the way, it looks like you only used data from 2020. Here is a graph of tweets since 2009. It also has multiple categories for Original, re-tweet, Replies, Links/images, Tweets.