Streaming Twitter data into a JMP data table
Scroll to the end if you just want the video.
I've done some minimal testing of this JSL on Mac and Win, in JMP 11 and 12. The video takes advantage of features in a future version of JMP, but don't let that stop you. Add a comment here on your experience. The JSL is in the attached ZIP file.
This post touches, lightly, a bunch of topics.
- Twitter account
- streaming data
- bit manipulation
- JSON
- OAuth authorization, signing
- cURL
- map coordinates
Overview: unzip the JSL into a directory. Create a Twitter app, copy the keys to a JSL file. (Windows) get a copy of cURL and fill in the path in MAIN.jsl. Run the MAIN.jsl file.
Begin
You’ll need a Twitter account, and you need to read and understand the Twitter Terms of Use before you start. (The data you collect in a stream does not belong to you and comes with restrictions on how you can use it. I won’t give you advice on what’s OK or not.)
Using your Twitter account, use https://apps.twitter.com/ to create a new application. This is NOT an app like a cell phone has. It is a set of keys that lets a program access the Twitter stream. That program might be an app on a cell phone, or it might be JSL, or a Python program like http://www.tweepy.org/ provides. (I started with tweepy and then went down the JSL path. Tweepy, or another python based solution might be better than this JSL solution.)
Click New to make a new application, make up a name no one else has used, add a description, fill in your web site name and leave the callback URL blank. You have to supply a web site name, but, as it suggests, you can use a placeholder. Read the agreement, then Click Create Twitter Application.
You landed on a screen with tabs; the third tab, Keys and Access Tokens, is the one you want.
At the top you’ll find part of what you need, the Consumer Key and Consumer Secret. You’ll need the consumer key and secret. Go ahead and start your JMP project; there will be several files, so make a directory, unzip the JSL files, and edit EncryptedKeys.JSL and start filling it in:
// use apps.twitter.com to create a new application. Copy the keys here.
consumerKey = "Pnb…LAZ";
consumerSecret = "AF4C…………………………qNvXp";
Click the button Create my access token at the bottom to get the last two values.
Copy them...Access Token is all on one line with a hyphen...
accessToken = "3…………………………-………………………………………………am";
accessTokenSecret = "rc………………………HJ";
EncryptedKeys.JSL has five lines, including the comment. It will be included by another file; the security of the computer you store EncryptedKeys.JSL on is something you should consider. Two of the keys are considered public, but two are secrets that only you should know. No one else needs the public keys, so they can all stay in one place.
Visit https://apps.twitter.com/ and make sure the app exists, then you can log out of twitter.
Twitter Streaming
Twitter offers other ways to use their service https://dev.twitter.com/ that allow retrieving specific tweets and creating new tweets, but streaming is about collecting real time tweets using a filter. The filter Twitter can apply is designed to be efficient for their computers and may not do exactly what you want (remember how much you are paying them). The most interesting queries are for keywords and geographic areas; the result will be the union (not intersection) of all the search terms. The example video used “breakfast”, “lunch”, and “dinner” without a geographic bounding box. World-wide, that retrieved about 3 tweets per second. It isn’t possible to narrow the search to London with a bounding box around London; that will try to return all of the tweets from London, as well as Breakfast, Lunch, and Dinner from around the world. I’m still unclear on how much data Twitter will stream; at one point I thought Twitter would throttle back, and at another point I thought Twitter stopped the stream if I couldn’t process the data fast enough. Maybe both are true. In a previous experiment, I selected for “Raleigh” and put a bounding box around Raleigh, NC. I got tweets about stolen Raleigh bicycles around the world.
Caution
You’ll need to do your own post-capture filtering, and you may want to build your own list of words you don’t like. If you present a list of uncensored tweets in public, you’ll be embarrassed, or worse. Also remember Twitter has rules for how you can present the data.
Geographic Selection
This snippet of JSL picks query terms and geographic locations. Because the code was getting complicated, I started using namespaces:
include(“PICK.jsl”);
PICK:trackParm = "lunch,dinner,breakfast"; // default query
PICK:pick(); // GUI to adjust query
The Add Rectangle button added two rectangles. The values in the PICK namespace will be used later to build the twitter query. There is small amount of “and” capability; eat lunch, eat dinner, eat breakfast could be used to select tweets containing both “eat” and a meal. But mostly everything is ORed (union).
More includes
include("JSON.jsl"); // subset of Xan's JSON library to parse a tweet
include("BITS.jsl"); // code for bit manipulation
include("AUTH.jsl"); // code for oauth
include("SIGN.jsl"); // code for signing
include("PARM.jsl"); // code for building the http parm for curl
These includes load a bunch of JSL and do some small sanity checks and self tests. BITS takes a minute to initialize some data structures to make faster code later, be patient while the +++ prints to the log window. JSON is taken from work Xan did earlier; the Twitter stream is sent in JSON format. AUTH and SIGN contain JSL that uses the EncryptedKeys to authenticate (“logon”) to Twitter. It’s complicated and uses an external program already installed on your computer to do the SHA1 hash. When I wrote the bit functions, I thought I would have to implement the SHA1 hash, but it is complicated too, and I’m glad to use one that doesn’t have to be debugged. The last include, PARM, builds a string for cURL.
cURL
You’ll need a copy of cURL. It already exists on MAC. Windows folk can get it from http://curl.haxx.se/ after you check with your systems people first. https://en.wikipedia.org/wiki/CURL is a starting point for your cURL questions. I downloaded Win64 x86_64 CAB. (your computer is I386, not IA64). Drag and drop curl.exe onto the desktop and make sure JMP knows where to find it (RunProgram won’t expand $DESKTOP, fill in the complete answer!):
cURLlocation="C:\Users\User\Desktop\curl.exe";
cURL is a command line utility JMP can access through the RunProgram function. The cURL program will make the connection to Twitter and feed the stream of tweets back to JMP.
RunProgram(executable( cURLlocation ),options( cURLparms ),
readFunction(Function( {this},
buftxt ||= this << read( blob );
process();
If( this << isReadEOF,
Write( Right( Char( buftxt ), 100 ) ) );))
The cURLparms and process() function (used above) are all that’s left to explain.
cURLparms comes from PARM:curlParm( PICK:trackParm, PICK:locationParm ); which is using PICK (way above) to figure out what terms to track and what locations to track. The curlParm function in the PARM namespace uses the consumerKey...applicationTokenSecret variables (way above) to do the signing/authorization for the specific terms and locations.
The process function takes data from buftxt and puts it in a data table.
Run
Done! Put the attached files in the same directory with the EncryptedKeys.JSL, open the main.jsl and run it. The dialog will open; (just press OK the first time...to use the search that makes the map points show up...) set a few search terms, or a small location, or maybe both.
At the end of the JSL you’ll see a loop that restarts the RunProgram when it stops unexpectedly. It delays for 60 seconds to help prevent abusing the Twitter site if something goes wrong in the script. Internet connections might stay up for days or might fail several times an hour. It’s also possible for Twitter to halt the connection for various reasons. The for-loop is also limited to only restart a small number of times.
The scripts are barely tested; ask if it doesn’t work, thanks!
The second data table and the map depend on the default breakfast/lunch/dinner values in the opening dialog. There is some hard-coded JSL that is looking for the breakfast/lunch/dinner tweets. You could re-work it or cut it out completely.
The STOP button at the bottom of the map will (probably) bring the script to a graceful stop. I fixed a bug that may prevent JMP on MAC from stopping gracefully for a future release; you might need to force-quit in JMP 11 or 12.
Presentation (making a video from the data table)
After capturing the breakfast/lunch/dinner tweets for a while, there’s enough data to make a video. Using ideas from several recent posts… For want of a better color coding standard, use green for starting the day and red for stopping the day and blue for lunch time. Subset the rows to keep just the ones with location data. I used a formula to create xLon and yLat columns from either coordinates or the center of placeBounding_boxCoordinates:
The formula sets the value to missing if the coordinates are not available; only about 5% of the tweets have coordinates.
This column formula made a column to categorize the tweets, but still different for each language:
Followed by this column formula to combine categories:
To speed up the generation of the video frames, load the subsetted data into matrices:
dt = open("$desktop/sha1/TwitterFeed3.jmp");
xLonVec=dt:xLon<<getAsMatrix;
yLatVec=dt:yLat<<getAsMatrix;
kindVec = dt:kindCode<<getAsMatrix;
timeVec = dt:dateUTC<<getAsMatrix;
close(dt,nosave);
Make a GraphBox to produce the frames:
New Window( "Meal Tweets", // the graph's marker size is used for the red/green/blue tweet locations
g = Graph Box( Frame Size( 1920, 900 ), <<Background Map( Images( "Detailed Earth", Transparency( 1 ) ), Boundaries( "World" ) ) , <<markersize(6))
);
There is more JSL to set the axes, add the map. Most of the work is done in this graphic script that processes the matrices loaded from the data table:
g[framebox( 1 )] << addGraphicsScript(
transparency(.8);
for(idx=lowIndex,idx<=highIndex,idx++,
Marker( Color State( match(kindVec[idx],1,"green",2,"blue",3,"red") ),{xLonVec[idx], yLatVec[idx]} );
)
);
And the driver code that causes the map to redraw with different start and end times and captures a picture:
nframes = seconds*30;// seconds at 30fps
start = min(timeVec);
end = max(timeVec);
duration = end - start;
delta = duration / nframes;
For( iframe = 0, iframe < nframes, iframe++,
mindate = start + (iframe - 0) * delta - 2*3600; // back 2 hours.
maxdate = start + (iframe + 1) * delta + 3600; // forward 1 hour
lowIndex=floor(interpolate(mindate,timeVec,1::nrows(timeVec)));
highIndex=floor(interpolate(maxdate,timeVec,1::nrows(timeVec)));
g<<reshow;
Wait( 1 );
g[framebox(1)]<<savepicture("$temp/tweet"||right(char(iframe,6),6,"0")||".png","png");
);
Eventually you’ll have the frames for producing a video. I used Blender (long learning curve, but powerful) to make this five-minute video, which has 377,765 tweets over 16 days that include the US Thanksgiving holiday. There were 8,187,114 tweets in all, but most did not include latitude and longitude data. If I do this again, I'll subset the data as it is collected; the 2.5GB full table is big.
https://www.youtube.com/watch?v=N20KsrmdgOU Maximize!
No attachment? Click here to see the post.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.