I have produced a file of 60,000 some tweets mentioning both "Hawaii" and "Covid."
If I use the Row Selection command to identify duplicates by Tweet.id, a presumably unique number, I get some 23,000 putative duplicates. This is a snippet:
What is clear is that the records grouped together under the same tweets.id are not the same records, judging by the author id, and most importantly, by the text. I stored all of the ID variables as text upon reading in with the jstor application.
Is it conceivable that the id numbers have been truncated? Tweet ids presumably are built in part from a timestamp, so that they are not likely to be consecutive numbers.