Week 28 code
Getting Started
This most of this document is based off of code written by Michael Kearney demonstrating how to use rtweet.
access token/secret method
Replace the app name (New2021proj) with whatever you called your app when you created it for your API keys.
Installing/loading packages & auth
If you haven’t already installed the rtweet package, do so now - Install {rtweet} from CRAN.
Otherwise, load the package - Load {rtweet}
library(rtweet)
## load any other packages you may need
library(dplyr)
library(maps)
library(ggplot2)
Make sure that you have your authentication with the API keys loaded. If you did the authentication above, you can just enter “get_token()” and it should ensure your keys are authenticated.
<Token>
<oauth_endpoint>
request: https://api.twitter.com/oauth/request_token
authorize: https://api.twitter.com/oauth/authenticate
access: https://api.twitter.com/oauth/access_token
<oauth_app> New2021proj
key: rsSoV8bRT29xvOJR2k95gJ50t
secret: <hidden>
<credentials> oauth_token, oauth_token_secret
---
Searching for tweets with search_tweets
search_tweets()
Search for one or more keyword(s)
# A tibble: 100 x 90
user_id status_id created_at screen_name text source
<chr> <chr> <dttm> <chr> <chr> <chr>
1 319413… 13550189… 2021-01-29 05:04:24 eldanirive… "Ya … Twitt…
2 437533… 13550189… 2021-01-29 05:04:18 youlovebri… "jus… Twitt…
3 132092… 13550188… 2021-01-29 05:04:08 nuggetdepi… "qui… Twitt…
4 121522… 13550188… 2021-01-29 05:04:03 Usura_Tacos "中国、… Twitt…
5 164182… 13550187… 2021-01-29 05:03:50 Myrieth "Ya … Twitt…
6 265200… 13550187… 2021-01-29 05:03:50 kitsuruo "qui… Twitt…
7 193587… 13550187… 2021-01-29 05:03:43 AKs_tacos "この暴… Twitt…
8 118105… 13550187… 2021-01-29 05:03:42 MLAISSAG "De … Twitt…
9 217690… 13550187… 2021-01-29 05:03:41 carolvanes… "i w… Twitt…
10 309598… 13550187… 2021-01-29 05:03:34 carlosmayo… "Los… Twitt…
# … with 90 more rows, and 84 more variables: display_text_width <dbl>,
# reply_to_status_id <chr>, reply_to_user_id <chr>,
# reply_to_screen_name <chr>, is_quote <lgl>, is_retweet <lgl>,
# favorite_count <int>, retweet_count <int>, quote_count <int>,
# reply_count <int>, hashtags <list>, symbols <list>, urls_url <list>,
# urls_t.co <list>, urls_expanded_url <list>, media_url <list>,
# media_t.co <list>, media_expanded_url <list>, media_type <list>,
# ext_media_url <list>, ext_media_t.co <list>, ext_media_expanded_url <list>,
# ext_media_type <chr>, mentions_user_id <list>, mentions_screen_name <list>,
# lang <chr>, quoted_status_id <chr>, quoted_text <chr>,
# quoted_created_at <dttm>, quoted_source <chr>, quoted_favorite_count <int>,
# quoted_retweet_count <int>, quoted_user_id <chr>, quoted_screen_name <chr>,
# quoted_name <chr>, quoted_followers_count <int>,
# quoted_friends_count <int>, quoted_statuses_count <int>,
# quoted_location <chr>, quoted_description <chr>, quoted_verified <lgl>,
# retweet_status_id <chr>, retweet_text <chr>, retweet_created_at <dttm>,
# retweet_source <chr>, retweet_favorite_count <int>,
# retweet_retweet_count <int>, retweet_user_id <chr>,
# retweet_screen_name <chr>, retweet_name <chr>,
# retweet_followers_count <int>, retweet_friends_count <int>,
# retweet_statuses_count <int>, retweet_location <chr>,
# retweet_description <chr>, retweet_verified <lgl>, place_url <chr>,
# place_name <chr>, place_full_name <chr>, place_type <chr>, country <chr>,
# country_code <chr>, geo_coords <list>, coords_coords <list>,
# bbox_coords <list>, status_url <chr>, name <chr>, location <chr>,
# description <chr>, url <chr>, protected <lgl>, followers_count <int>,
# friends_count <int>, listed_count <int>, statuses_count <int>,
# favourites_count <int>, account_created_at <dttm>, verified <lgl>,
# profile_url <chr>, profile_expanded_url <chr>, account_lang <lgl>,
# profile_banner_url <chr>, profile_background_url <chr>,
# profile_image_url <chr>
If you want multiple words there is an implicit AND
between words
# A tibble: 100 x 90
user_id status_id created_at screen_name text source
<chr> <chr> <dttm> <chr> <chr> <chr>
1 650513… 13550177… 2021-01-29 04:59:35 JohnMLatim… "@UR… Twitt…
2 870504… 13550175… 2021-01-29 04:58:41 AllisonFar… "You… Twitt…
3 127716… 13550135… 2021-01-29 04:42:50 hekticamer… "Onc… Twitt…
4 153680… 13550125… 2021-01-29 04:38:57 DennisMuir… "@Na… Twitt…
5 969688… 13550108… 2021-01-29 04:32:13 thwacked "@Co… Twitt…
6 476502… 13550099… 2021-01-29 04:28:47 Ryanlamber… "@Sa… Twitt…
7 112678… 13550094… 2021-01-29 04:26:50 trixasis2 "You… Twitt…
8 121632… 13550091… 2021-01-29 04:25:32 HornChick75 "@mi… Twitt…
9 200172… 13550039… 2021-01-29 04:04:40 TheGreatDa… "@Bu… Twitt…
10 987353… 13549967… 2021-01-29 03:36:24 QuoteTomCr… "@ta… Cheap…
# … with 90 more rows, and 84 more variables: display_text_width <dbl>,
# reply_to_status_id <chr>, reply_to_user_id <chr>,
# reply_to_screen_name <chr>, is_quote <lgl>, is_retweet <lgl>,
# favorite_count <int>, retweet_count <int>, quote_count <int>,
# reply_count <int>, hashtags <list>, symbols <list>, urls_url <list>,
# urls_t.co <list>, urls_expanded_url <list>, media_url <list>,
# media_t.co <list>, media_expanded_url <list>, media_type <list>,
# ext_media_url <list>, ext_media_t.co <list>, ext_media_expanded_url <list>,
# ext_media_type <chr>, mentions_user_id <list>, mentions_screen_name <list>,
# lang <chr>, quoted_status_id <chr>, quoted_text <chr>,
# quoted_created_at <dttm>, quoted_source <chr>, quoted_favorite_count <int>,
# quoted_retweet_count <int>, quoted_user_id <chr>, quoted_screen_name <chr>,
# quoted_name <chr>, quoted_followers_count <int>,
# quoted_friends_count <int>, quoted_statuses_count <int>,
# quoted_location <chr>, quoted_description <chr>, quoted_verified <lgl>,
# retweet_status_id <chr>, retweet_text <chr>, retweet_created_at <dttm>,
# retweet_source <chr>, retweet_favorite_count <int>,
# retweet_retweet_count <int>, retweet_user_id <chr>,
# retweet_screen_name <chr>, retweet_name <chr>,
# retweet_followers_count <int>, retweet_friends_count <int>,
# retweet_statuses_count <int>, retweet_location <chr>,
# retweet_description <chr>, retweet_verified <lgl>, place_url <chr>,
# place_name <chr>, place_full_name <chr>, place_type <chr>, country <chr>,
# country_code <chr>, geo_coords <list>, coords_coords <list>,
# bbox_coords <list>, status_url <chr>, name <chr>, location <chr>,
# description <chr>, url <chr>, protected <lgl>, followers_count <int>,
# friends_count <int>, listed_count <int>, statuses_count <int>,
# favourites_count <int>, account_created_at <dttm>, verified <lgl>,
# profile_url <chr>, profile_expanded_url <chr>, account_lang <lgl>,
# profile_banner_url <chr>, profile_background_url <chr>,
# profile_image_url <chr>
search for exact phrase
## single quotes around doubles
ds <- search_tweets('"data science"')
## or escape the quotes
ds <- search_tweets("\"data science\"")
ds
# A tibble: 100 x 90
user_id status_id created_at screen_name text source
<chr> <chr> <dttm> <chr> <chr> <chr>
1 106198… 13550189… 2021-01-29 05:04:21 VeritasArd… "Dem… Twitt…
2 353265… 13550189… 2021-01-29 05:04:19 bexxmodd "Put… BexxP…
3 108248… 13550189… 2021-01-29 05:04:15 epuujee "Put… Puuje…
4 108248… 13550132… 2021-01-29 04:41:43 epuujee "Eng… Puuje…
5 108248… 13550122… 2021-01-29 04:37:56 epuujee "[10… Puuje…
6 108248… 13550150… 2021-01-29 04:48:59 epuujee "Dem… Puuje…
7 108248… 13550106… 2021-01-29 04:31:19 epuujee "Hap… Puuje…
8 108248… 13550187… 2021-01-29 05:03:36 epuujee "Her… Puuje…
9 108248… 13550137… 2021-01-29 04:43:37 epuujee "AIC… Puuje…
10 108248… 13550089… 2021-01-29 04:24:42 epuujee "Can… Puuje…
# … with 90 more rows, and 84 more variables: display_text_width <dbl>,
# reply_to_status_id <chr>, reply_to_user_id <chr>,
# reply_to_screen_name <chr>, is_quote <lgl>, is_retweet <lgl>,
# favorite_count <int>, retweet_count <int>, quote_count <int>,
# reply_count <int>, hashtags <list>, symbols <list>, urls_url <list>,
# urls_t.co <list>, urls_expanded_url <list>, media_url <list>,
# media_t.co <list>, media_expanded_url <list>, media_type <list>,
# ext_media_url <list>, ext_media_t.co <list>, ext_media_expanded_url <list>,
# ext_media_type <chr>, mentions_user_id <list>, mentions_screen_name <list>,
# lang <chr>, quoted_status_id <chr>, quoted_text <chr>,
# quoted_created_at <dttm>, quoted_source <chr>, quoted_favorite_count <int>,
# quoted_retweet_count <int>, quoted_user_id <chr>, quoted_screen_name <chr>,
# quoted_name <chr>, quoted_followers_count <int>,
# quoted_friends_count <int>, quoted_statuses_count <int>,
# quoted_location <chr>, quoted_description <chr>, quoted_verified <lgl>,
# retweet_status_id <chr>, retweet_text <chr>, retweet_created_at <dttm>,
# retweet_source <chr>, retweet_favorite_count <int>,
# retweet_retweet_count <int>, retweet_user_id <chr>,
# retweet_screen_name <chr>, retweet_name <chr>,
# retweet_followers_count <int>, retweet_friends_count <int>,
# retweet_statuses_count <int>, retweet_location <chr>,
# retweet_description <chr>, retweet_verified <lgl>, place_url <chr>,
# place_name <chr>, place_full_name <chr>, place_type <chr>, country <chr>,
# country_code <chr>, geo_coords <list>, coords_coords <list>,
# bbox_coords <list>, status_url <chr>, name <chr>, location <chr>,
# description <chr>, url <chr>, protected <lgl>, followers_count <int>,
# friends_count <int>, listed_count <int>, statuses_count <int>,
# favourites_count <int>, account_created_at <dttm>, verified <lgl>,
# profile_url <chr>, profile_expanded_url <chr>, account_lang <lgl>,
# profile_banner_url <chr>, profile_background_url <chr>,
# profile_image_url <chr>
keywords and phrases
Search for keyword(s) and phrases
# A tibble: 100 x 90
user_id status_id created_at screen_name text source
<chr> <chr> <dttm> <chr> <chr> <chr>
1 143562… 13550123… 2021-01-29 04:38:12 silentseaw… "Lin… "Twit…
2 134739… 13550106… 2021-01-29 04:31:30 CoderRetwe… "Inf… ""
3 134739… 13549967… 2021-01-29 03:36:17 CoderRetwe… "Dat… ""
4 134739… 13549166… 2021-01-28 22:18:01 CoderRetwe… "Lin… ""
5 134739… 13549228… 2021-01-28 22:42:27 CoderRetwe… "The… ""
6 134739… 13549681… 2021-01-29 01:42:42 CoderRetwe… "Top… ""
7 126705… 13550082… 2021-01-29 04:22:00 _codenewbi… "Inf… "Code…
8 126705… 13549174… 2021-01-28 22:21:00 _codenewbi… "Lin… "Code…
9 126705… 13549778… 2021-01-29 02:21:00 _codenewbi… "#Da… "Code…
10 105164… 13549780… 2021-01-29 02:21:52 Fabriciosx "⭕ I… "twit…
# … with 90 more rows, and 84 more variables: display_text_width <dbl>,
# reply_to_status_id <lgl>, reply_to_user_id <lgl>,
# reply_to_screen_name <lgl>, is_quote <lgl>, is_retweet <lgl>,
# favorite_count <int>, retweet_count <int>, quote_count <int>,
# reply_count <int>, hashtags <list>, symbols <list>, urls_url <list>,
# urls_t.co <list>, urls_expanded_url <list>, media_url <list>,
# media_t.co <list>, media_expanded_url <list>, media_type <list>,
# ext_media_url <list>, ext_media_t.co <list>, ext_media_expanded_url <list>,
# ext_media_type <chr>, mentions_user_id <list>, mentions_screen_name <list>,
# lang <chr>, quoted_status_id <chr>, quoted_text <chr>,
# quoted_created_at <dttm>, quoted_source <chr>, quoted_favorite_count <int>,
# quoted_retweet_count <int>, quoted_user_id <chr>, quoted_screen_name <chr>,
# quoted_name <chr>, quoted_followers_count <int>,
# quoted_friends_count <int>, quoted_statuses_count <int>,
# quoted_location <chr>, quoted_description <chr>, quoted_verified <lgl>,
# retweet_status_id <chr>, retweet_text <chr>, retweet_created_at <dttm>,
# retweet_source <chr>, retweet_favorite_count <int>,
# retweet_retweet_count <int>, retweet_user_id <chr>,
# retweet_screen_name <chr>, retweet_name <chr>,
# retweet_followers_count <int>, retweet_friends_count <int>,
# retweet_statuses_count <int>, retweet_location <chr>,
# retweet_description <chr>, retweet_verified <lgl>, place_url <chr>,
# place_name <chr>, place_full_name <chr>, place_type <chr>, country <chr>,
# country_code <chr>, geo_coords <list>, coords_coords <list>,
# bbox_coords <list>, status_url <chr>, name <chr>, location <chr>,
# description <chr>, url <chr>, protected <lgl>, followers_count <int>,
# friends_count <int>, listed_count <int>, statuses_count <int>,
# favourites_count <int>, account_created_at <dttm>, verified <lgl>,
# profile_url <chr>, profile_expanded_url <chr>, account_lang <lgl>,
# profile_banner_url <chr>, profile_background_url <chr>,
# profile_image_url <chr>
increasing number of results
search_tweets()
returns 100 most recent matching tweets by defaultIncrease
n
to return more (tip: use intervals of 100)
# A tibble: 500 x 90
user_id status_id created_at screen_name text source
<chr> <chr> <dttm> <chr> <chr> <chr>
1 727469… 13550189… 2021-01-29 05:04:30 frontasal20 "@be… Twitt…
2 110307… 13550189… 2021-01-29 05:04:30 92089204a "【残り… Twitt…
3 757239… 13550189… 2021-01-29 05:04:28 beer_naabo "建売新… Twitt…
4 757239… 13550176… 2021-01-29 04:59:29 beer_naabo "じいち… Twitt…
5 757239… 13550179… 2021-01-29 05:00:32 beer_naabo "晴れて… Twitt…
6 757239… 13550182… 2021-01-29 05:01:50 beer_naabo "ブーブ… Twitt…
7 517559… 13550182… 2021-01-29 05:01:40 Kirin_Brew… "@ma… Belug…
8 517559… 13550172… 2021-01-29 04:57:34 Kirin_Brew… "@ar… Belug…
9 517559… 13550174… 2021-01-29 04:58:21 Kirin_Brew… "@ma… Belug…
10 517559… 13550188… 2021-01-29 05:04:10 Kirin_Brew… "@ya… Belug…
# … with 490 more rows, and 84 more variables: display_text_width <dbl>,
# reply_to_status_id <chr>, reply_to_user_id <chr>,
# reply_to_screen_name <chr>, is_quote <lgl>, is_retweet <lgl>,
# favorite_count <int>, retweet_count <int>, quote_count <int>,
# reply_count <int>, hashtags <list>, symbols <list>, urls_url <list>,
# urls_t.co <list>, urls_expanded_url <list>, media_url <list>,
# media_t.co <list>, media_expanded_url <list>, media_type <list>,
# ext_media_url <list>, ext_media_t.co <list>, ext_media_expanded_url <list>,
# ext_media_type <chr>, mentions_user_id <list>, mentions_screen_name <list>,
# lang <chr>, quoted_status_id <chr>, quoted_text <chr>,
# quoted_created_at <dttm>, quoted_source <chr>, quoted_favorite_count <int>,
# quoted_retweet_count <int>, quoted_user_id <chr>, quoted_screen_name <chr>,
# quoted_name <chr>, quoted_followers_count <int>,
# quoted_friends_count <int>, quoted_statuses_count <int>,
# quoted_location <chr>, quoted_description <chr>, quoted_verified <lgl>,
# retweet_status_id <chr>, retweet_text <chr>, retweet_created_at <dttm>,
# retweet_source <chr>, retweet_favorite_count <int>,
# retweet_retweet_count <int>, retweet_user_id <chr>,
# retweet_screen_name <chr>, retweet_name <chr>,
# retweet_followers_count <int>, retweet_friends_count <int>,
# retweet_statuses_count <int>, retweet_location <chr>,
# retweet_description <chr>, retweet_verified <lgl>, place_url <chr>,
# place_name <chr>, place_full_name <chr>, place_type <chr>, country <chr>,
# country_code <chr>, geo_coords <list>, coords_coords <list>,
# bbox_coords <list>, status_url <chr>, name <chr>, location <chr>,
# description <chr>, url <chr>, protected <lgl>, followers_count <int>,
# friends_count <int>, listed_count <int>, statuses_count <int>,
# favourites_count <int>, account_created_at <dttm>, verified <lgl>,
# profile_url <chr>, profile_expanded_url <chr>, account_lang <lgl>,
# profile_banner_url <chr>, profile_background_url <chr>,
# profile_image_url <chr>
Please be mindful that you have a rate limit of 18,000 per fifteen minutes, which means you can only pull this much in one search and will get errors after that for 15 min
getting a lot more tweets
PRO TIP #1: Get the firehose for free by searching for tweets by verified or non-verified tweets
# A tibble: 2,889 x 90
user_id status_id created_at screen_name text source
<chr> <chr> <dttm> <chr> <chr> <chr>
1 131795… 13550189… 2021-01-29 05:04:36 z902pWuSsc… "@nu… Twitt…
2 488373… 13550189… 2021-01-29 05:04:36 Rumefeller "@2a… Twitt…
3 127828… 13550189… 2021-01-29 05:04:36 34gr_ "أخر… Twitt…
4 956079… 13550189… 2021-01-29 05:04:36 LesNadines "@BF… Twitt…
5 969528… 13550189… 2021-01-29 05:04:36 gurnd_blue "2C6… グランブル…
6 436210… 13550189… 2021-01-29 05:04:36 StahlAmy "On … Goodr…
7 788805… 13550189… 2021-01-29 05:04:36 GNSGRadio "Now… GNSG …
8 128852… 13550189… 2021-01-29 05:04:36 waengnamja "@JI… Twitt…
9 128863… 13550189… 2021-01-29 05:04:36 Seokjinnie… ".@B… Twitt…
10 132976… 13550189… 2021-01-29 05:04:36 shoakunoko… "トイレ… Twitt…
# … with 2,879 more rows, and 84 more variables: display_text_width <dbl>,
# reply_to_status_id <chr>, reply_to_user_id <chr>,
# reply_to_screen_name <chr>, is_quote <lgl>, is_retweet <lgl>,
# favorite_count <int>, retweet_count <int>, quote_count <int>,
# reply_count <int>, hashtags <list>, symbols <list>, urls_url <list>,
# urls_t.co <list>, urls_expanded_url <list>, media_url <list>,
# media_t.co <list>, media_expanded_url <list>, media_type <list>,
# ext_media_url <list>, ext_media_t.co <list>, ext_media_expanded_url <list>,
# ext_media_type <chr>, mentions_user_id <list>, mentions_screen_name <list>,
# lang <chr>, quoted_status_id <chr>, quoted_text <chr>,
# quoted_created_at <dttm>, quoted_source <chr>, quoted_favorite_count <int>,
# quoted_retweet_count <int>, quoted_user_id <chr>, quoted_screen_name <chr>,
# quoted_name <chr>, quoted_followers_count <int>,
# quoted_friends_count <int>, quoted_statuses_count <int>,
# quoted_location <chr>, quoted_description <chr>, quoted_verified <lgl>,
# retweet_status_id <chr>, retweet_text <chr>, retweet_created_at <dttm>,
# retweet_source <chr>, retweet_favorite_count <int>,
# retweet_retweet_count <int>, retweet_user_id <chr>,
# retweet_screen_name <chr>, retweet_name <chr>,
# retweet_followers_count <int>, retweet_friends_count <int>,
# retweet_statuses_count <int>, retweet_location <chr>,
# retweet_description <chr>, retweet_verified <lgl>, place_url <chr>,
# place_name <chr>, place_full_name <chr>, place_type <chr>, country <chr>,
# country_code <chr>, geo_coords <list>, coords_coords <list>,
# bbox_coords <list>, status_url <chr>, name <chr>, location <chr>,
# description <chr>, url <chr>, protected <lgl>, followers_count <int>,
# friends_count <int>, listed_count <int>, statuses_count <int>,
# favourites_count <int>, account_created_at <dttm>, verified <lgl>,
# profile_url <chr>, profile_expanded_url <chr>, account_lang <lgl>,
# profile_banner_url <chr>, profile_background_url <chr>,
# profile_image_url <chr>
plotting tweets
Visualize second-by-second frequency
twitter search operators
You can combine any of the above commands to extract what you are searching for.
PRO TIP #2: Use search operators provided by Twitter, e.g.,
- filter by language and exclude retweets and replies
- filter only tweets linking to news articles
filtering in search_tweets
- filter only tweets that contain links
# A tibble: 100 x 90
user_id status_id created_at screen_name text source
<chr> <chr> <dttm> <chr> <chr> <chr>
1 816818… 13550190… 2021-01-29 05:05:02 MyYedammm "สรุ… Twitt…
2 129630… 13550190… 2021-01-29 05:05:02 s125osanm "@Ke… Twitt…
3 290155… 13550190… 2021-01-29 05:05:02 XO_jellyDO "최수종… Twitt…
4 124856… 13550190… 2021-01-29 05:05:02 fallingflo… "190… Twitt…
5 101297… 13550190… 2021-01-29 05:05:02 NorthAjith… "ஆகஸ… Twitt…
6 881903… 13550190… 2021-01-29 05:05:02 A_ightK "칠흑만… Twitt…
7 774234… 13550190… 2021-01-29 05:05:02 kmlsantos_ "#BT… Twitt…
8 855390… 13550190… 2021-01-29 05:05:02 Yeriel_hei "CIA… Twitt…
9 112664… 13550190… 2021-01-29 05:05:02 tongkonnee "อี้… Twitt…
10 109945… 13550190… 2021-01-29 05:05:02 Cerinn_n "อยา… Twitt…
# … with 90 more rows, and 84 more variables: display_text_width <dbl>,
# reply_to_status_id <lgl>, reply_to_user_id <lgl>,
# reply_to_screen_name <lgl>, is_quote <lgl>, is_retweet <lgl>,
# favorite_count <int>, retweet_count <int>, quote_count <int>,
# reply_count <int>, hashtags <list>, symbols <list>, urls_url <list>,
# urls_t.co <list>, urls_expanded_url <list>, media_url <list>,
# media_t.co <list>, media_expanded_url <list>, media_type <list>,
# ext_media_url <list>, ext_media_t.co <list>, ext_media_expanded_url <list>,
# ext_media_type <chr>, mentions_user_id <list>, mentions_screen_name <list>,
# lang <chr>, quoted_status_id <chr>, quoted_text <chr>,
# quoted_created_at <dttm>, quoted_source <chr>, quoted_favorite_count <int>,
# quoted_retweet_count <int>, quoted_user_id <chr>, quoted_screen_name <chr>,
# quoted_name <chr>, quoted_followers_count <int>,
# quoted_friends_count <int>, quoted_statuses_count <int>,
# quoted_location <chr>, quoted_description <chr>, quoted_verified <lgl>,
# retweet_status_id <chr>, retweet_text <chr>, retweet_created_at <dttm>,
# retweet_source <chr>, retweet_favorite_count <int>,
# retweet_retweet_count <int>, retweet_user_id <chr>,
# retweet_screen_name <chr>, retweet_name <chr>,
# retweet_followers_count <int>, retweet_friends_count <int>,
# retweet_statuses_count <int>, retweet_location <chr>,
# retweet_description <chr>, retweet_verified <lgl>, place_url <chr>,
# place_name <chr>, place_full_name <chr>, place_type <chr>, country <chr>,
# country_code <chr>, geo_coords <list>, coords_coords <list>,
# bbox_coords <list>, status_url <chr>, name <chr>, location <chr>,
# description <chr>, url <chr>, protected <lgl>, followers_count <int>,
# friends_count <int>, listed_count <int>, statuses_count <int>,
# favourites_count <int>, account_created_at <dttm>, verified <lgl>,
# profile_url <chr>, profile_expanded_url <chr>, account_lang <lgl>,
# profile_banner_url <chr>, profile_background_url <chr>,
# profile_image_url <chr>
- filter only tweets that contain video
# A tibble: 82 x 90
user_id status_id created_at screen_name text source
<chr> <chr> <dttm> <chr> <chr> <chr>
1 118258… 13549601… 2021-01-29 01:10:40 xiaohuangs "Mir… Twitt…
2 366374… 13548703… 2021-01-28 19:13:58 RedoxRidwan "Tem… Twitt…
3 133365… 13548694… 2021-01-28 19:10:22 SlashyTiger "@re… Twitt…
4 129144… 13548674… 2021-01-28 19:02:17 Darasim984… "Tem… Twitt…
5 126710… 13548518… 2021-01-28 18:00:34 SBB_2K "New… Twitt…
6 126710… 13524191… 2021-01-22 00:53:48 SBB_2K "*IM… Twitt…
7 149048… 13548304… 2021-01-28 16:35:18 MrBtheNige… "Tem… Twitt…
8 287782… 13548170… 2021-01-28 15:42:11 mackeankan… "Tem… Twitt…
9 462344… 13548141… 2021-01-28 15:30:37 Kloinsoffi… "Tem… Twitt…
10 373163… 13548035… 2021-01-28 14:48:33 AbdulsamodA "Tem… Twitt…
# … with 72 more rows, and 84 more variables: display_text_width <dbl>,
# reply_to_status_id <chr>, reply_to_user_id <chr>,
# reply_to_screen_name <chr>, is_quote <lgl>, is_retweet <lgl>,
# favorite_count <int>, retweet_count <int>, quote_count <int>,
# reply_count <int>, hashtags <list>, symbols <list>, urls_url <list>,
# urls_t.co <list>, urls_expanded_url <list>, media_url <list>,
# media_t.co <list>, media_expanded_url <list>, media_type <list>,
# ext_media_url <list>, ext_media_t.co <list>, ext_media_expanded_url <list>,
# ext_media_type <chr>, mentions_user_id <list>, mentions_screen_name <list>,
# lang <chr>, quoted_status_id <chr>, quoted_text <chr>,
# quoted_created_at <dttm>, quoted_source <chr>, quoted_favorite_count <int>,
# quoted_retweet_count <int>, quoted_user_id <chr>, quoted_screen_name <chr>,
# quoted_name <chr>, quoted_followers_count <int>,
# quoted_friends_count <int>, quoted_statuses_count <int>,
# quoted_location <chr>, quoted_description <chr>, quoted_verified <lgl>,
# retweet_status_id <chr>, retweet_text <chr>, retweet_created_at <dttm>,
# retweet_source <chr>, retweet_favorite_count <int>,
# retweet_retweet_count <int>, retweet_user_id <chr>,
# retweet_screen_name <chr>, retweet_name <chr>,
# retweet_followers_count <int>, retweet_friends_count <int>,
# retweet_statuses_count <int>, retweet_location <chr>,
# retweet_description <chr>, retweet_verified <lgl>, place_url <chr>,
# place_name <chr>, place_full_name <chr>, place_type <chr>, country <chr>,
# country_code <chr>, geo_coords <list>, coords_coords <list>,
# bbox_coords <list>, status_url <chr>, name <chr>, location <chr>,
# description <chr>, url <chr>, protected <lgl>, followers_count <int>,
# friends_count <int>, listed_count <int>, statuses_count <int>,
# favourites_count <int>, account_created_at <dttm>, verified <lgl>,
# profile_url <chr>, profile_expanded_url <chr>, account_lang <lgl>,
# profile_banner_url <chr>, profile_background_url <chr>,
# profile_image_url <chr>
tweets sent by screennames
- filter only tweets sent
from:{screen_name}
orto:{screen_name}
certain users
## vector of screen names
users <- c("cnnbrk", "AP", "nytimes",
"foxnews", "msnbc", "seanhannity", "maddow")
## then use search_tweets
tousers <- search_tweets(paste0("from:", users, collapse = " OR "))
tousers
# A tibble: 100 x 90
user_id status_id created_at screen_name text source
<chr> <chr> <dttm> <chr> <chr> <chr>
1 2836421 13550183… 2021-01-29 05:02:07 MSNBC "\"T… Socia…
2 2836421 13549700… 2021-01-29 01:50:05 MSNBC ".@F… Socia…
3 2836421 13549899… 2021-01-29 03:09:14 MSNBC "TUN… Tweet…
4 2836421 13550149… 2021-01-29 04:48:41 MSNBC "Gam… Tweet…
5 2836421 13549765… 2021-01-29 02:16:04 MSNBC "Dem… Socia…
6 2836421 13549685… 2021-01-29 01:44:05 MSNBC "Som… Socia…
7 2836421 13550138… 2021-01-29 04:44:05 MSNBC "WAT… Wildm…
8 2836421 13549730… 2021-01-29 02:02:03 MSNBC "Dr.… Socia…
9 2836421 13550108… 2021-01-29 04:32:04 MSNBC "\"I… Socia…
10 2836421 13549830… 2021-01-29 02:41:43 MSNBC "Liv… Wildm…
# … with 90 more rows, and 84 more variables: display_text_width <dbl>,
# reply_to_status_id <chr>, reply_to_user_id <chr>,
# reply_to_screen_name <chr>, is_quote <lgl>, is_retweet <lgl>,
# favorite_count <int>, retweet_count <int>, quote_count <int>,
# reply_count <int>, hashtags <list>, symbols <list>, urls_url <list>,
# urls_t.co <list>, urls_expanded_url <list>, media_url <list>,
# media_t.co <list>, media_expanded_url <list>, media_type <list>,
# ext_media_url <list>, ext_media_t.co <list>, ext_media_expanded_url <list>,
# ext_media_type <chr>, mentions_user_id <list>, mentions_screen_name <list>,
# lang <chr>, quoted_status_id <chr>, quoted_text <chr>,
# quoted_created_at <dttm>, quoted_source <chr>, quoted_favorite_count <int>,
# quoted_retweet_count <int>, quoted_user_id <chr>, quoted_screen_name <chr>,
# quoted_name <chr>, quoted_followers_count <int>,
# quoted_friends_count <int>, quoted_statuses_count <int>,
# quoted_location <chr>, quoted_description <chr>, quoted_verified <lgl>,
# retweet_status_id <chr>, retweet_text <chr>, retweet_created_at <dttm>,
# retweet_source <chr>, retweet_favorite_count <int>,
# retweet_retweet_count <int>, retweet_user_id <chr>,
# retweet_screen_name <chr>, retweet_name <chr>,
# retweet_followers_count <int>, retweet_friends_count <int>,
# retweet_statuses_count <int>, retweet_location <chr>,
# retweet_description <chr>, retweet_verified <lgl>, place_url <chr>,
# place_name <chr>, place_full_name <chr>, place_type <chr>, country <chr>,
# country_code <chr>, geo_coords <list>, coords_coords <list>,
# bbox_coords <list>, status_url <chr>, name <chr>, location <chr>,
# description <chr>, url <chr>, protected <lgl>, followers_count <int>,
# friends_count <int>, listed_count <int>, statuses_count <int>,
# favourites_count <int>, account_created_at <dttm>, verified <lgl>,
# profile_url <chr>, profile_expanded_url <chr>, account_lang <lgl>,
# profile_banner_url <chr>, profile_background_url <chr>,
# profile_image_url <chr>
searching only verified accounts
- filter only tweets with at least 100 favorites or 100 retweets
- filter by the type of device that posted the tweet.
search_tweets() with location
Search by geolocation (ex: tweets within 25 miles of Durham University)
# A tibble: 500 x 90
user_id status_id created_at screen_name text source
<chr> <chr> <dttm> <chr> <chr> <chr>
1 745272… 13550190… 2021-01-29 05:04:49 PoetDeanwi… "@BT… Twitt…
2 108111… 13550190… 2021-01-29 05:04:46 layzhangbe… "htt… Twitt…
3 189566… 13550189… 2021-01-29 05:04:36 workidabbz "@Ce… Twitt…
4 189566… 13550078… 2021-01-29 04:20:23 workidabbz "@hi… Twitt…
5 189566… 13550136… 2021-01-29 04:43:19 workidabbz "@Ce… Twitt…
6 189566… 13550185… 2021-01-29 05:03:00 workidabbz "@_e… Twitt…
7 102308… 13550189… 2021-01-29 05:04:26 IRISHPACER_ "Can… Twitt…
8 102308… 13550189… 2021-01-29 05:04:26 IRISHPACER_ "Ims… Twitt…
9 134901… 13550188… 2021-01-29 05:04:06 Nichola620… "@ho… Twitt…
10 134243… 13550188… 2021-01-29 05:03:58 bibenson2 "Dav… Twitt…
# … with 490 more rows, and 84 more variables: display_text_width <dbl>,
# reply_to_status_id <chr>, reply_to_user_id <chr>,
# reply_to_screen_name <chr>, is_quote <lgl>, is_retweet <lgl>,
# favorite_count <int>, retweet_count <int>, quote_count <int>,
# reply_count <int>, hashtags <list>, symbols <list>, urls_url <list>,
# urls_t.co <list>, urls_expanded_url <list>, media_url <list>,
# media_t.co <list>, media_expanded_url <list>, media_type <list>,
# ext_media_url <list>, ext_media_t.co <list>, ext_media_expanded_url <list>,
# ext_media_type <chr>, mentions_user_id <list>, mentions_screen_name <list>,
# lang <chr>, quoted_status_id <chr>, quoted_text <chr>,
# quoted_created_at <dttm>, quoted_source <chr>, quoted_favorite_count <int>,
# quoted_retweet_count <int>, quoted_user_id <chr>, quoted_screen_name <chr>,
# quoted_name <chr>, quoted_followers_count <int>,
# quoted_friends_count <int>, quoted_statuses_count <int>,
# quoted_location <chr>, quoted_description <chr>, quoted_verified <lgl>,
# retweet_status_id <chr>, retweet_text <chr>, retweet_created_at <dttm>,
# retweet_source <chr>, retweet_favorite_count <int>,
# retweet_retweet_count <int>, retweet_user_id <chr>,
# retweet_screen_name <chr>, retweet_name <chr>,
# retweet_followers_count <int>, retweet_friends_count <int>,
# retweet_statuses_count <int>, retweet_location <chr>,
# retweet_description <chr>, retweet_verified <lgl>, place_url <chr>,
# place_name <chr>, place_full_name <chr>, place_type <chr>, country <chr>,
# country_code <chr>, geo_coords <list>, coords_coords <list>,
# bbox_coords <list>, status_url <chr>, name <chr>, location <chr>,
# description <chr>, url <chr>, protected <lgl>, followers_count <int>,
# friends_count <int>, listed_count <int>, statuses_count <int>,
# favourites_count <int>, account_created_at <dttm>, verified <lgl>,
# profile_url <chr>, profile_expanded_url <chr>, account_lang <lgl>,
# profile_banner_url <chr>, profile_background_url <chr>,
# profile_image_url <chr>
mapping geotagged tweets
Use lat_lng()
to convert geographical data into lat
and lng
variables (single point)
setting up some basic parameters
I used Google Maps to get the Lat/Long of Durham. I set these as variables so that I could later just pull from them.
#lat and long of durham
xlong<--1.5803916
ylat<-54.7649859
# Where in maps database is this lat and long? (create a variable for this)
region<-map.where(database = "world", xlong,ylat )
Mapping the geotagged tweets
#create lat/lng variables using all available tweet and profile geo-location data
durham25 <- lat_lng(durham25)
#notice how I use the region variable I created above and add to the xlong/ylat variables to set my extents?
maps::map("world",regions = region, fill = TRUE, col = "#ffffff", lwd = .25, mar = c(0, 0, 0, 0), xlim = c((xlong-5), (xlong+5)), y = c(ylat-5, ylat+5))
with(durham25, points(lng, lat, pch = 20, col = "red"))
This code plots geotagged tweets within 25 miles of Durham on a map of the UK
Please note if you were making a map of the United States the maps::map()
has 3 databases for the USA and only one for “world” see help(package='maps')
for more details.
searching in an entire country
Search by geo-location—for example, find 10,000 tweets in the English language sent from the United States. Note: some countries and cities are hardcoded in the API, while sometimes lookup_coords() requires users have a Google API key
search for 5,000 tweets in english, sent from the US
These tweets are all geotagged. We’ll discuss more about geographic identifiers later.
Week 29: more on the Twitter API
Other things we can collect
Last week we discussed ways to use the search_tweet
fucntion of the Twitter API. The search_tweet only allows you to go backward in time, and it can only collect data from Twitter’s Tweet object model
Depending on what you are trying to collect, you may need to try one of these other funcitons.
Plotting those
Group by screen_name
and plot hourly frequencies of tweets.
Remember that %>%
is a pipe and may not work if you have not loaded the dplyr
library
User favorites with get_favorites()
Get up to the most recent 3,000 tweets favorited/liked by a user. This is the posts a user has clicked the heart button on.
Please note that it has to be the US spelling of “favorites”
Lookup statuses with lookup_tweets()
If you look in any of the dataframes in your ‘global environment’ (such as dugeog_fav
or unis
), you’ll notice that one of the attributes you’ve been pulling is the status_id
of the Tweet.
This is like the identification number (or phone number) of that particular tweet.
Getting the users’ network
Friends/followers
Twitter’s API documentation distinguishes between friends and followers.
- Friend refers to an account a given user follows
- Follower refers to an account following a given user
Pulling a users’ friends get_friends()
Get user IDs of accounts followed by (AKA friends) [@jack](https://twitter.com/jack), the co-founder and CEO of Twitter.
# A tibble: 4,538 x 2
user user_id
<chr> <chr>
1 jack 1354898820400877571
2 jack 27058194
3 jack 257436924
4 jack 1282418324228337665
5 jack 2190757022
6 jack 3291691
7 jack 14918591
8 jack 14115083
9 jack 38113183
10 jack 536429909
# … with 4,528 more rows
multiple users’ friends
Get friends of multiple users in a single call.
# A tibble: 4,433 x 2
user user_id
<chr> <chr>
1 durham_uni 270004438
2 durham_uni 34918353
3 durham_uni 4861601645
4 durham_uni 2330017078
5 durham_uni 14494181
6 durham_uni 1122968208389099520
7 durham_uni 1016236315845824512
8 durham_uni 57645871
9 durham_uni 186104486
10 durham_uni 20324317
# … with 4,423 more rows
get_followers()
Get user IDs of accounts following (AKA followers) [@GeogDurham](https://twitter.com/GeogDurham).
# A tibble: 3,554 x 1
user_id
<chr>
1 3165094085
2 1354923680992792587
3 1296509996092588034
4 275015329
5 238203281
6 732612019283709954
7 316210273
8 1354503938628837376
9 835404726
10 1354381953848512515
# … with 3,544 more rows
large numbers of followers
get_followers()
Unlike friends (limited by Twitter to 5,000), there is no limit on the number of followers.
To get user IDs of all 64(ish) million followers of Justin Timberlake (@jtimberlake, you need two things:
- A stable internet connection
- Time – approximately seven days
It’s probably not a good idea to harvest an account like @jtimberlake unless you really need it for research.
But here is how you would do it anyway.
Get all of Justin Timberlake’s followers.
Lookup users
the lookup_users()
function of rtweet looks at the the users’ profile location by pulling from the user object of the Twitter API
Lookup users-level (and most recent tweet) information associated with vector of user_id
or screen_name
(you can use either)
## vector of users
users <- c("durham_uni", "NCL_Geography", "oiioxford")
## lookup users twitter data
usr <- lookup_users(users)
usr
# A tibble: 3 x 90
user_id status_id created_at screen_name text source
<chr> <chr> <dttm> <chr> <chr> <chr>
1 277391… 13548366… 2021-01-28 17:00:07 durham_uni "📢 W… "Falc…
2 222989… 13548268… 2021-01-28 16:21:07 NCL_Geogra… "Has… "Twit…
3 499764… 13549006… 2021-01-28 21:14:18 oiioxford "Eth… "Twit…
# … with 84 more variables: display_text_width <int>, reply_to_status_id <lgl>,
# reply_to_user_id <lgl>, reply_to_screen_name <lgl>, is_quote <lgl>,
# is_retweet <lgl>, favorite_count <int>, retweet_count <int>,
# quote_count <int>, reply_count <int>, hashtags <list>, symbols <list>,
# urls_url <list>, urls_t.co <list>, urls_expanded_url <list>,
# media_url <list>, media_t.co <list>, media_expanded_url <list>,
# media_type <list>, ext_media_url <list>, ext_media_t.co <list>,
# ext_media_expanded_url <list>, ext_media_type <chr>,
# mentions_user_id <list>, mentions_screen_name <list>, lang <chr>,
# quoted_status_id <chr>, quoted_text <chr>, quoted_created_at <dttm>,
# quoted_source <chr>, quoted_favorite_count <int>,
# quoted_retweet_count <int>, quoted_user_id <chr>, quoted_screen_name <chr>,
# quoted_name <chr>, quoted_followers_count <int>,
# quoted_friends_count <int>, quoted_statuses_count <int>,
# quoted_location <chr>, quoted_description <chr>, quoted_verified <lgl>,
# retweet_status_id <chr>, retweet_text <chr>, retweet_created_at <dttm>,
# retweet_source <chr>, retweet_favorite_count <int>,
# retweet_retweet_count <int>, retweet_user_id <chr>,
# retweet_screen_name <chr>, retweet_name <chr>,
# retweet_followers_count <int>, retweet_friends_count <int>,
# retweet_statuses_count <int>, retweet_location <chr>,
# retweet_description <chr>, retweet_verified <lgl>, place_url <chr>,
# place_name <chr>, place_full_name <chr>, place_type <chr>, country <chr>,
# country_code <chr>, geo_coords <list>, coords_coords <list>,
# bbox_coords <list>, status_url <chr>, name <chr>, location <chr>,
# description <chr>, url <chr>, protected <lgl>, followers_count <int>,
# friends_count <int>, listed_count <int>, statuses_count <int>,
# favourites_count <int>, account_created_at <dttm>, verified <lgl>,
# profile_url <chr>, profile_expanded_url <chr>, account_lang <lgl>,
# profile_banner_url <chr>, profile_background_url <chr>,
# profile_image_url <chr>
If you look through the output of usr
, you’ll note that it pulls 90 different variables for the user, including their profile location.
search for users
Just as we searched for tweets at the beginning with the search_tweets()
function, we can use the search_users()
much the same way. Twitter will look for matches in user names, screen names, and profile bios.
# A tibble: 100 x 90
user_id status_id created_at screen_name text source
<chr> <chr> <dttm> <chr> <chr> <chr>
1 6017542 13549545… 2021-01-29 00:48:35 BreakingNe… "Leg… "Twee…
2 5402612 13549135… 2021-01-28 22:05:42 BBCBreaking "Cov… "Soci…
3 428333 13549553… 2021-01-29 00:51:43 cnnbrk "Cic… "Soci…
4 173187… 13548627… 2021-01-28 18:43:52 NationBrea… "LAI… "Twee…
5 923263… 13017168… 2020-09-04 03:00:50 ftbreaking… "Chi… "Soci…
6 189155… 13549854… 2021-01-29 02:51:27 ChicagoBre… "Bla… ""
7 981100… 13549575… 2021-01-29 01:00:38 MirrorBrea… "Pio… "Twee…
8 141387… 13547825… 2021-01-28 13:24:56 TelegraphN… "🚨 I… "Echo…
9 874167… 13549048… 2021-01-28 21:30:53 SkyNewsBre… "Pri… "Twee…
10 386230… 13550030… 2021-01-29 04:01:23 gmanewsbre… "Vac… "Twee…
# … with 90 more rows, and 84 more variables: display_text_width <dbl>,
# reply_to_status_id <chr>, reply_to_user_id <chr>,
# reply_to_screen_name <chr>, is_quote <lgl>, is_retweet <lgl>,
# favorite_count <int>, retweet_count <int>, quote_count <int>,
# reply_count <int>, hashtags <list>, symbols <list>, urls_url <list>,
# urls_t.co <list>, urls_expanded_url <list>, media_url <list>,
# media_t.co <list>, media_expanded_url <list>, media_type <list>,
# ext_media_url <list>, ext_media_t.co <list>, ext_media_expanded_url <list>,
# ext_media_type <chr>, mentions_user_id <list>, mentions_screen_name <list>,
# lang <chr>, quoted_status_id <chr>, quoted_text <chr>,
# quoted_created_at <dttm>, quoted_source <chr>, quoted_favorite_count <int>,
# quoted_retweet_count <int>, quoted_user_id <chr>, quoted_screen_name <chr>,
# quoted_name <chr>, quoted_followers_count <int>,
# quoted_friends_count <int>, quoted_statuses_count <int>,
# quoted_location <chr>, quoted_description <chr>, quoted_verified <lgl>,
# retweet_status_id <chr>, retweet_text <chr>, retweet_created_at <dttm>,
# retweet_source <chr>, retweet_favorite_count <int>,
# retweet_retweet_count <int>, retweet_user_id <chr>,
# retweet_screen_name <chr>, retweet_name <chr>,
# retweet_followers_count <int>, retweet_friends_count <int>,
# retweet_statuses_count <int>, retweet_location <chr>,
# retweet_description <chr>, retweet_verified <lgl>, place_url <chr>,
# place_name <chr>, place_full_name <chr>, place_type <chr>, country <chr>,
# country_code <chr>, geo_coords <list>, coords_coords <list>,
# bbox_coords <list>, status_url <chr>, name <chr>, location <chr>,
# description <chr>, url <chr>, protected <lgl>, followers_count <int>,
# friends_count <int>, listed_count <int>, statuses_count <int>,
# favourites_count <int>, account_created_at <dttm>, verified <lgl>,
# profile_url <chr>, profile_expanded_url <chr>, account_lang <lgl>,
# profile_banner_url <chr>, profile_background_url <chr>,
# profile_image_url <chr>
Lists
In my opinion List’s are not used all that often, and you don’t necessarily need this, so feel free to skip this one…
using lists_memberships()
- Get an account’s list memberships (lists that include an account). Sorry I can’t think of a British account to put here… You all probably don’t read Nate Silver’s blog, fivethirtyeight.com but it really is excellent at using statistics to explain and predict politics, sports, and science.
# A tibble: 200 x 11
list_id name uri subscriber_count member_count mode description slug
<chr> <chr> <chr> <int> <int> <chr> <chr> <chr>
1 135494… US E… /zmi… 0 13 publ… "" us-e…
2 135493… Poli… /Dis… 0 91 publ… "" poli…
3 135488… News… /sco… 0 40 publ… "" news…
4 135486… inte… /Lut… 0 7 publ… "Interesti… inte…
5 135484… Poli… /tim… 0 48 publ… "" poli…
6 135482… Poli… /Jon… 0 29 publ… "All thing… poli…
7 135482… Poli… /eva… 0 1 publ… "" poli…
8 135445… covi… /pis… 0 34 publ… "" covi…
9 135435… USA … /DMC… 0 12 publ… "" usa-…
10 135426… 538 /par… 0 3 publ… "" 538-…
# … with 190 more rows, and 3 more variables: full_name <chr>,
# created_at <dttm>, following <lgl>
lists_members()
- Get all list members (accounts on a list)
You can refer to a list either by it’s name or by it’s list_id, but you have to have at least one of those.
slug
is the name that displays on Twitter. You can identify a list by its slug instead of its numerical id. If you decide to do so, note that you’ll also have to specify the list owner using the owner_id or owner_user parameters.
## all members of a human geography list
cng <- lists_members(owner_user = "criticalens", slug = "Human-Geo")
cng
# A tibble: 50 x 40
user_id name screen_name location description url protected
<chr> <chr> <chr> <chr> <chr> <chr> <lgl>
1 128167… GSAG… gsag_aag "" "Graduate … http… FALSE
2 119332… The … theblackgeo "" "A project… http… FALSE
3 988284… Dial… DialoguesHG "" "Dialogues… http… FALSE
4 959102… Urba… GenUrbNetw… "Toront… "A SSHRC-f… http… FALSE
5 931286… ani … ani_Landau… "Naarm … "Researchi… <NA> FALSE
6 890963… Lati… LatinxGeog "" "A group o… http… FALSE
7 869604… AAG … QTGAAG "The Sp… "Twitter a… http… FALSE
8 867451… Plac… _pscollect… "" "The Place… http… FALSE
9 861906… Mara… mara_ferre… "London… "Urban & h… http… FALSE
10 846396… Arri… Arrianna_P… "" "Medical G… http… FALSE
# … with 40 more rows, and 33 more variables: followers_count <int>,
# friends_count <int>, listed_count <int>, created_at <dttm>,
# favourites_count <int>, utc_offset <lgl>, time_zone <lgl>,
# geo_enabled <lgl>, verified <lgl>, statuses_count <int>, lang <lgl>,
# contributors_enabled <lgl>, is_translator <lgl>,
# is_translation_enabled <lgl>, profile_background_color <chr>,
# profile_background_image_url <chr>,
# profile_background_image_url_https <chr>, profile_background_tile <lgl>,
# profile_image_url <chr>, profile_image_url_https <chr>,
# profile_banner_url <chr>, profile_link_color <chr>,
# profile_sidebar_border_color <chr>, profile_sidebar_fill_color <chr>,
# profile_text_color <chr>, profile_use_background_image <lgl>,
# has_extended_profile <lgl>, default_profile <lgl>,
# default_profile_image <lgl>, following <lgl>, follow_request_sent <lgl>,
# notifications <lgl>, translator_type <chr>
Streaming tweets
This one is really important. This is likely what you will use most to capture an emerging issue.
This is what I use the most. It uses Twitter’s streaming API to just listen and harvest tweets as they happen.
Please note: it can not go back in time and retrieve tweets that have already happened, it can only go forward. But it’s very powerful as you can target it specifically and set it up to capture your data for you.
using stream_tweets()
for a small random sample
Sampling: small random sample (~ 1%
) of all publicly available tweets
The default is that this will timeout after 30 seconds of streaming. We’ll adjust that later.
Filtering: search-like query (up to 400 keywords)
Keep in mind this is like an implicit OR statement between the terms, it will harvest anything with any one of the terms. These could be hashtags or words in the tweet itself.
streaming from users
You could stream from specific users. You could list out all the users that you want to stream from as a vector.
Tracking: vector of user ids (you could have up to 5000 user_ids)
As the default is 30 seconds, this doesn’t really work unless they tweet within the 30 seconds. Unless this for someone like Donald Trump, this is pretty useless.
streaming tweets from a geographic area
Location: geographical coordinates (1-360 degree location boxes)
Let’s say you wanted to limit your stream to a specific city or geographic area. Perhaps you were interested in a local event that was going to happen.You would draw a bounding box on the earth and stream anything within that geographic window.
from OSM’s wiki: A bounding box (usually shortened to bbox) is an area defined by two longitudes and two latitudes, where:
Latitude is a decimal number between -90.0 and 90.0. Longitude is a decimal number between -180.0 and 180.0.
They usually follow the standard format of:
left,bottom,right,top min Longitude , min Latitude , max Longitude , max Latitude
For example, Greater London is enclosed by: 0.489,51.28,0.236,51.686
Let’s start by streaming all the tweets for the whole world
Please note that this returns a smaller number of tweets the open stream you did before because many people have privacy settings that prevent you from identifying their location.
Some folks made a handy tool to draw bounding boxes on earth and get the coordinates. Visit https://boundingbox.klokantech.com/ and put “csv” in the drop down. Draw and move the rectangle that you would want to harvest. Copy the line of coordinates it shows in the ‘CSV:’ window
for the random square I drew over the UK, Europe, and parts of North Africa: -12.1,17.8,42.0,58.2
Please note that if this area is too small and nobody is tweeting during the time you are streaming, there will be nothing returned.
Again, note that this is only streaming for 30 seconds, so unless the area is huge or there are some very active users that geotag their tweets, this isn’t going to pull up much information
(side note: I’m also creating this in the middle of the night in Europe/UK/North Africa… when people are more likley to be sleeping than tweeting)
finding cities with lookup_coords()
A useful convenience function–though it now requires a Google Maps API key–for quickly looking up coordinates. You have to store a credit card number for most Google Maps API key functions, so we’re not going to do that.
To enable basic uses of the ‘lookup_coords()’ without requiring a Google Maps API key, a number of the major cities throughout the word and the ‘world’ and ‘usa’ are baked into this function. If ‘world’ is supplied then a bounding box of maximum latitutde/longitude values, i.e., c(-180, -90, 180, 90), and a center point c(0, 0) are returned. If ‘usa’ is supplied then estimates of the United States’ bounding box and mid-point are returned.
To specify a city, provide the city name followed by a space and then the US state abbreviation or country name. To see a list of all included cities, enter rtweet:::citycoords in the R console to see coordinates data.
Let’s try that:
# A tibble: 747 x 3
city lat lng
<chr> <dbl> <dbl>
1 aberdeen scotland 57.2 -2.15
2 aberdeen 57.2 -2.15
3 aberdeen scotland 57.2 -2.15
4 adelaide australia -34.9 139.
5 adelaide -34.9 139.
6 adelaide australia -34.9 139.
7 algiers algeria 36.8 3
8 algiers 36.8 3
9 algiers algeria 36.8 3
10 amsterdam netherlands 52.4 4.88
# … with 737 more rows
For example, if you wanted to just stream tweets from London, it would look like this:
Increasing the time ofstream_tweets()
The default duration for streams is thirty seconds timeout = 30
We can increase this time out by using timeout
Please note that it measures in seconds, so
- Specify specific stream duration in seconds
- 60 seconds = 1 minute
- 3600 seconds = 1 hour
- Use math to figure out how long you want to stream for
indefinitely streaming tweets
Be careful with this because the file quickly becomes unwieldy. We’ll talk more about how to have stream_tweets()
run indefinitely but save a different file every X minutes in future sessions.
Stream tweets indefinitely.
stream_tweets()
Keep in mind that the results all of these stream_tweets()
operations are just stored in the memory of your computer. You have not written anything to the hard disk to save it. If your computer crashes (or restarts) while streaming tweets, you’ve likely lost whatever your stream was capturing.
At the same time, if you are getting tons of errors (or RStudio is crashing) in this process it might be because you are trying to hold too large of a file in memory and your computer can’t handle this.
So, instead you can stream the tweets (as JSON data) directly to a text file.
You need to make sure you have set the working directory for your R session as this will be writing a file to your hard disk.
I keep a folder called “r” in my root directory, so that I can find these files when I need to.
As a side note, when I’m pulling a serious amount of data for research I try not to have this write to a folder that is in cloud storage (Dropbox, Box, Sharepoint, icloud, etc…)
Once this finishes (and your file is saved) you’ve got yourself a nice sample. It’s saved and it’s safe. Good work!
Pat yourself on the back, take a break, have a snack, drink some water. The hard part of this is over.
Let’s look at what we streamed
You can navigate to the file you saved above in your computers file system and open it if you’d like. I use Brackets (download at brackets.io) as a text editor, and it can’t open a file larger than 16MB. These streaming files get big quickly, and 3-minutes of streaming (in the middle of the night) produced a file of 50.4MB, so I can’t open it in a normal text-editor. If you are on a Windows computer, and have Notepad++ installed, that should be able to open it.
Either way, it’s not a big deal if you can’t open it and look at it because you’ll be processing it in R.
Read-in a streamed JSON file
You will be using the parse_stream
command to import the .json file.
This command is really finicky and I often have problems if I’ve created an extra large file and one of the keys is corrupted or something.
What R has to do here is convert a .json file to a ‘flat’ file. The .json file has entire tables within individual cells (e.g. the user-attributes) and the flat file puts them as columns adjacent to each other in a table.
Troubleshooting the read-in problem
If you get an error that looks like:
Error: lexical error: invalid char in json text. :"aoi_fj_hiraoka","name":",{"created_at":"Fri Jan 29 03:32:2 (right here) ------^
Do not panic.
You just need a different way to stream in the tweets. There are many different ways to do this.
Try installing the jsonlite
package
# load the jsonlite library (you may need to install it)
library(jsonlite)
r_object <- fromJSON(readLines("randomtweets3min.json"))
Did you get another error? If so, we need to run a special function. Copy and paste the recover_stream.R script from Github user JBGRuber and then call it.
Here is my copy-paste of this function:
#' Recovers Twitter damaged stream data (JSON file) into parsed data frame.
#'
#' @param path Character, name of JSON file with data collected by
#' \code{\link{stream_tweets}}.
#' @param dir Character, name of a directory where intermediate files are
#' stored.
#' @param verbose Logical, should progress be displayed?
#'
#' @family stream tweets
recover_stream <- function(path, dir = NULL, verbose = TRUE) {
# read file and split to tweets
lines <- readChar(path, file.info(path)$size, useBytes = TRUE)
tweets <- stringi::stri_split_fixed(lines, "\n{")[[1]]
tweets[-1] <- paste0("{", tweets[-1])
tweets <- tweets[!(tweets == "" | tweets == "{")]
# remove misbehaving characters
tweets <- gsub("\r", "", tweets, fixed = TRUE)
tweets <- gsub("\n", "", tweets, fixed = TRUE)
# write tweets to disk and try to read them in individually
if (is.null(dir)) {
dir <- paste0(tempdir(), "/tweets/")
dir.create(dir, showWarnings = FALSE)
}
if (verbose) {
pb <- progress::progress_bar$new(
format = "Processing tweets [:bar] :percent, :eta remaining",
total = length(tweets), clear = FALSE
)
pb$tick(0)
}
tweets_l <- lapply(tweets, function(t) {
pb$tick()
id <- unlist(stringi::stri_extract_first_regex(t, "(?<=id\":)\\d+(?=,)"))[1]
f <- paste0(dir, id, ".json")
writeLines(t, f, useBytes = TRUE)
out <- tryCatch(rtweet::parse_stream(f),
error = function(e) {})
if ("tbl_df" %in% class(out)) {
return(out)
} else {
return(id)
}
})
# test which ones failed
test <- vapply(tweets_l, is.character, FUN.VALUE = logical(1L))
bad_files <- unlist(tweets_l[test])
# Let user decide what to do
if (length(bad_files) > 0) {
message("There were ", length(bad_files),
" tweets with problems. Should they be copied to your working directory?")
sel <- menu(c("no", "yes", "copy a list with status_ids"))
if (sel == 2) {
dir.create(paste0(getwd(), "/broken_tweets/"), showWarnings = FALSE)
file.copy(
from = paste0(dir, bad_files, ".json"),
to = paste0(getwd(), "/broken_tweets/", bad_files, ".json")
)
} else if (sel == 3) {
writeLines(bad_files, "broken_tweets.txt")
}
}
# clean up
unlink(dir, recursive = TRUE)
# return good tweets
return(dplyr::bind_rows(tweets_l[!test]))
}
Once you’ve run that, you’ll see over in your Global Environment that there is now a new function listed under “functions”
Now we can call that function
This process is quite slow, and it will stop to ask you what you want to do with the problematic tweets. It’s difficult to use these and I find it’s not worth trying.