I have produced a file of 60,000 some tweets mentioning both "Hawaii" and "Covid."
If I use the Row Selection command to identify duplicates by Tweet.id, a presumably unique number, I get some 23,000 putative duplicates. This is a snippet:
What is clear is that the records grouped together under the same tweets.id are not the same records, judging by the author id, and most importantly, by the text. I stored all of the ID variables as text upon reading in with the jstor application.
Is it conceivable that the id numbers have been truncated? Tweet ids presumably are built in part from a timestamp, so that they are not likely to be consecutive numbers.
maybe this: https://developer.twitter.com/en/docs/twitter-ids
If you keep the ID in a numeric variable, you will lose some of the 64 bit integer data because there are only ~53 bits of fraction in a double.
maybe this: https://developer.twitter.com/en/docs/twitter-ids
If you keep the ID in a numeric variable, you will lose some of the 64 bit integer data because there are only ~53 bits of fraction in a double.