cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Choose Language Hide Translation Bar
LNitz
Level III

Tweets in JMP: Duplicate tweet.id numbers that are not duplicate tweets

I have produced a file of 60,000 some tweets mentioning both "Hawaii" and "Covid."

If I use the Row Selection command to identify duplicates by Tweet.id, a presumably unique number, I get some 23,000 putative duplicates. This is a snippet:

LNitz_0-1638066135279.png

What is clear is that the records grouped together under the same tweets.id are not the same records, judging by the author id, and most importantly, by the text.  I stored all of the ID variables as text upon reading in with the jstor application. 

Is it conceivable that the id numbers have been truncated?  Tweet ids presumably are built in part from a timestamp, so that they are not likely to be consecutive numbers.

 

1 ACCEPTED SOLUTION

Accepted Solutions
Craige_Hales
Super User

Re: Tweets in JMP: Duplicate tweet.id numbers that are not duplicate tweets

maybe this: https://developer.twitter.com/en/docs/twitter-ids 

If you keep the ID in a numeric variable, you will lose some of the 64 bit integer data because there are only ~53 bits of fraction in a double.

Craige

View solution in original post

1 REPLY 1
Craige_Hales
Super User

Re: Tweets in JMP: Duplicate tweet.id numbers that are not duplicate tweets

maybe this: https://developer.twitter.com/en/docs/twitter-ids 

If you keep the ID in a numeric variable, you will lose some of the 64 bit integer data because there are only ~53 bits of fraction in a double.

Craige