Week 28 code

Getting Started

This most of this document is based off of code written by Michael Kearney demonstrating how to use rtweet.

Authentication/authorization of API keys

Learn more about Authentication and loading your API keys here (with images): Cran’s rtweet auth

There are multiple ways of doing this, but I find I get errors when I do the browser based method. So, this is the token-based authentication method.

  1. Navigate to developer.twitter.com/en/apps and select your Twitter app
  2. Click the tab labeled Keys and tokens to retrieve your keys.
  3. Locate the Consumer API keys (aka “API Secret”).

store api keys (use this code, but replace with your own keys)

api_key <- "texthere"
api_secret_key <- "texthere"
access_token <- "texthere"
access_token_secret <- "texthere"

access token/secret method

Replace the app name (New2021proj) with whatever you called your app when you created it for your API keys.

token <- create_token(
  app = "New2021proj",
  consumer_key = api_key,
  consumer_secret = api_secret_key,
  access_token = access_token,
  access_secret = access_token_secret)

Installing/loading packages & auth

If you haven’t already installed the rtweet package, do so now - Install {rtweet} from CRAN.

install.packages("rtweet")

Otherwise, load the package - Load {rtweet}

library(rtweet)
## load any other packages you may need
library(dplyr)
library(maps)
library(ggplot2)

Make sure that you have your authentication with the API keys loaded. If you did the authentication above, you can just enter “get_token()” and it should ensure your keys are authenticated.

get_token()
<Token>
<oauth_endpoint>
 request:   https://api.twitter.com/oauth/request_token
 authorize: https://api.twitter.com/oauth/authenticate
 access:    https://api.twitter.com/oauth/access_token
<oauth_app> New2021proj
  key:    rsSoV8bRT29xvOJR2k95gJ50t
  secret: <hidden>
<credentials> oauth_token, oauth_token_secret
---

Searching for tweets with search_tweets

search_tweets()

Search for one or more keyword(s)

tacos <- search_tweets("tacos")
tacos
# A tibble: 100 x 90
   user_id status_id created_at          screen_name text  source
   <chr>   <chr>     <dttm>              <chr>       <chr> <chr> 
 1 319413… 13550189… 2021-01-29 05:04:24 eldanirive… "Ya … Twitt…
 2 437533… 13550189… 2021-01-29 05:04:18 youlovebri… "jus… Twitt…
 3 132092… 13550188… 2021-01-29 05:04:08 nuggetdepi… "qui… Twitt…
 4 121522… 13550188… 2021-01-29 05:04:03 Usura_Tacos "中国、… Twitt…
 5 164182… 13550187… 2021-01-29 05:03:50 Myrieth     "Ya … Twitt…
 6 265200… 13550187… 2021-01-29 05:03:50 kitsuruo    "qui… Twitt…
 7 193587… 13550187… 2021-01-29 05:03:43 AKs_tacos   "この暴… Twitt…
 8 118105… 13550187… 2021-01-29 05:03:42 MLAISSAG    "De … Twitt…
 9 217690… 13550187… 2021-01-29 05:03:41 carolvanes… "i w… Twitt…
10 309598… 13550187… 2021-01-29 05:03:34 carlosmayo… "Los… Twitt…
# … with 90 more rows, and 84 more variables: display_text_width <dbl>,
#   reply_to_status_id <chr>, reply_to_user_id <chr>,
#   reply_to_screen_name <chr>, is_quote <lgl>, is_retweet <lgl>,
#   favorite_count <int>, retweet_count <int>, quote_count <int>,
#   reply_count <int>, hashtags <list>, symbols <list>, urls_url <list>,
#   urls_t.co <list>, urls_expanded_url <list>, media_url <list>,
#   media_t.co <list>, media_expanded_url <list>, media_type <list>,
#   ext_media_url <list>, ext_media_t.co <list>, ext_media_expanded_url <list>,
#   ext_media_type <chr>, mentions_user_id <list>, mentions_screen_name <list>,
#   lang <chr>, quoted_status_id <chr>, quoted_text <chr>,
#   quoted_created_at <dttm>, quoted_source <chr>, quoted_favorite_count <int>,
#   quoted_retweet_count <int>, quoted_user_id <chr>, quoted_screen_name <chr>,
#   quoted_name <chr>, quoted_followers_count <int>,
#   quoted_friends_count <int>, quoted_statuses_count <int>,
#   quoted_location <chr>, quoted_description <chr>, quoted_verified <lgl>,
#   retweet_status_id <chr>, retweet_text <chr>, retweet_created_at <dttm>,
#   retweet_source <chr>, retweet_favorite_count <int>,
#   retweet_retweet_count <int>, retweet_user_id <chr>,
#   retweet_screen_name <chr>, retweet_name <chr>,
#   retweet_followers_count <int>, retweet_friends_count <int>,
#   retweet_statuses_count <int>, retweet_location <chr>,
#   retweet_description <chr>, retweet_verified <lgl>, place_url <chr>,
#   place_name <chr>, place_full_name <chr>, place_type <chr>, country <chr>,
#   country_code <chr>, geo_coords <list>, coords_coords <list>,
#   bbox_coords <list>, status_url <chr>, name <chr>, location <chr>,
#   description <chr>, url <chr>, protected <lgl>, followers_count <int>,
#   friends_count <int>, listed_count <int>, statuses_count <int>,
#   favourites_count <int>, account_created_at <dttm>, verified <lgl>,
#   profile_url <chr>, profile_expanded_url <chr>, account_lang <lgl>,
#   profile_banner_url <chr>, profile_background_url <chr>,
#   profile_image_url <chr>


If you want multiple words there is an implicit AND between words

cb <- search_tweets("cheap beer")
cb
# A tibble: 100 x 90
   user_id status_id created_at          screen_name text  source
   <chr>   <chr>     <dttm>              <chr>       <chr> <chr> 
 1 650513… 13550177… 2021-01-29 04:59:35 JohnMLatim… "@UR… Twitt…
 2 870504… 13550175… 2021-01-29 04:58:41 AllisonFar… "You… Twitt…
 3 127716… 13550135… 2021-01-29 04:42:50 hekticamer… "Onc… Twitt…
 4 153680… 13550125… 2021-01-29 04:38:57 DennisMuir… "@Na… Twitt…
 5 969688… 13550108… 2021-01-29 04:32:13 thwacked    "@Co… Twitt…
 6 476502… 13550099… 2021-01-29 04:28:47 Ryanlamber… "@Sa… Twitt…
 7 112678… 13550094… 2021-01-29 04:26:50 trixasis2   "You… Twitt…
 8 121632… 13550091… 2021-01-29 04:25:32 HornChick75 "@mi… Twitt…
 9 200172… 13550039… 2021-01-29 04:04:40 TheGreatDa… "@Bu… Twitt…
10 987353… 13549967… 2021-01-29 03:36:24 QuoteTomCr… "@ta… Cheap…
# … with 90 more rows, and 84 more variables: display_text_width <dbl>,
#   reply_to_status_id <chr>, reply_to_user_id <chr>,
#   reply_to_screen_name <chr>, is_quote <lgl>, is_retweet <lgl>,
#   favorite_count <int>, retweet_count <int>, quote_count <int>,
#   reply_count <int>, hashtags <list>, symbols <list>, urls_url <list>,
#   urls_t.co <list>, urls_expanded_url <list>, media_url <list>,
#   media_t.co <list>, media_expanded_url <list>, media_type <list>,
#   ext_media_url <list>, ext_media_t.co <list>, ext_media_expanded_url <list>,
#   ext_media_type <chr>, mentions_user_id <list>, mentions_screen_name <list>,
#   lang <chr>, quoted_status_id <chr>, quoted_text <chr>,
#   quoted_created_at <dttm>, quoted_source <chr>, quoted_favorite_count <int>,
#   quoted_retweet_count <int>, quoted_user_id <chr>, quoted_screen_name <chr>,
#   quoted_name <chr>, quoted_followers_count <int>,
#   quoted_friends_count <int>, quoted_statuses_count <int>,
#   quoted_location <chr>, quoted_description <chr>, quoted_verified <lgl>,
#   retweet_status_id <chr>, retweet_text <chr>, retweet_created_at <dttm>,
#   retweet_source <chr>, retweet_favorite_count <int>,
#   retweet_retweet_count <int>, retweet_user_id <chr>,
#   retweet_screen_name <chr>, retweet_name <chr>,
#   retweet_followers_count <int>, retweet_friends_count <int>,
#   retweet_statuses_count <int>, retweet_location <chr>,
#   retweet_description <chr>, retweet_verified <lgl>, place_url <chr>,
#   place_name <chr>, place_full_name <chr>, place_type <chr>, country <chr>,
#   country_code <chr>, geo_coords <list>, coords_coords <list>,
#   bbox_coords <list>, status_url <chr>, name <chr>, location <chr>,
#   description <chr>, url <chr>, protected <lgl>, followers_count <int>,
#   friends_count <int>, listed_count <int>, statuses_count <int>,
#   favourites_count <int>, account_created_at <dttm>, verified <lgl>,
#   profile_url <chr>, profile_expanded_url <chr>, account_lang <lgl>,
#   profile_banner_url <chr>, profile_background_url <chr>,
#   profile_image_url <chr>

search for exact phrase

## single quotes around doubles
ds <- search_tweets('"data science"')

## or escape the quotes
ds <- search_tweets("\"data science\"")
ds
# A tibble: 100 x 90
   user_id status_id created_at          screen_name text  source
   <chr>   <chr>     <dttm>              <chr>       <chr> <chr> 
 1 106198… 13550189… 2021-01-29 05:04:21 VeritasArd… "Dem… Twitt…
 2 353265… 13550189… 2021-01-29 05:04:19 bexxmodd    "Put… BexxP…
 3 108248… 13550189… 2021-01-29 05:04:15 epuujee     "Put… Puuje…
 4 108248… 13550132… 2021-01-29 04:41:43 epuujee     "Eng… Puuje…
 5 108248… 13550122… 2021-01-29 04:37:56 epuujee     "[10… Puuje…
 6 108248… 13550150… 2021-01-29 04:48:59 epuujee     "Dem… Puuje…
 7 108248… 13550106… 2021-01-29 04:31:19 epuujee     "Hap… Puuje…
 8 108248… 13550187… 2021-01-29 05:03:36 epuujee     "Her… Puuje…
 9 108248… 13550137… 2021-01-29 04:43:37 epuujee     "AIC… Puuje…
10 108248… 13550089… 2021-01-29 04:24:42 epuujee     "Can… Puuje…
# … with 90 more rows, and 84 more variables: display_text_width <dbl>,
#   reply_to_status_id <chr>, reply_to_user_id <chr>,
#   reply_to_screen_name <chr>, is_quote <lgl>, is_retweet <lgl>,
#   favorite_count <int>, retweet_count <int>, quote_count <int>,
#   reply_count <int>, hashtags <list>, symbols <list>, urls_url <list>,
#   urls_t.co <list>, urls_expanded_url <list>, media_url <list>,
#   media_t.co <list>, media_expanded_url <list>, media_type <list>,
#   ext_media_url <list>, ext_media_t.co <list>, ext_media_expanded_url <list>,
#   ext_media_type <chr>, mentions_user_id <list>, mentions_screen_name <list>,
#   lang <chr>, quoted_status_id <chr>, quoted_text <chr>,
#   quoted_created_at <dttm>, quoted_source <chr>, quoted_favorite_count <int>,
#   quoted_retweet_count <int>, quoted_user_id <chr>, quoted_screen_name <chr>,
#   quoted_name <chr>, quoted_followers_count <int>,
#   quoted_friends_count <int>, quoted_statuses_count <int>,
#   quoted_location <chr>, quoted_description <chr>, quoted_verified <lgl>,
#   retweet_status_id <chr>, retweet_text <chr>, retweet_created_at <dttm>,
#   retweet_source <chr>, retweet_favorite_count <int>,
#   retweet_retweet_count <int>, retweet_user_id <chr>,
#   retweet_screen_name <chr>, retweet_name <chr>,
#   retweet_followers_count <int>, retweet_friends_count <int>,
#   retweet_statuses_count <int>, retweet_location <chr>,
#   retweet_description <chr>, retweet_verified <lgl>, place_url <chr>,
#   place_name <chr>, place_full_name <chr>, place_type <chr>, country <chr>,
#   country_code <chr>, geo_coords <list>, coords_coords <list>,
#   bbox_coords <list>, status_url <chr>, name <chr>, location <chr>,
#   description <chr>, url <chr>, protected <lgl>, followers_count <int>,
#   friends_count <int>, listed_count <int>, statuses_count <int>,
#   favourites_count <int>, account_created_at <dttm>, verified <lgl>,
#   profile_url <chr>, profile_expanded_url <chr>, account_lang <lgl>,
#   profile_banner_url <chr>, profile_background_url <chr>,
#   profile_image_url <chr>

keywords and phrases

Search for keyword(s) and phrases

rpds <- search_tweets("rstats python \"data science\"")
rpds
# A tibble: 100 x 90
   user_id status_id created_at          screen_name text  source
   <chr>   <chr>     <dttm>              <chr>       <chr> <chr> 
 1 143562… 13550123… 2021-01-29 04:38:12 silentseaw… "Lin… "Twit…
 2 134739… 13550106… 2021-01-29 04:31:30 CoderRetwe… "Inf… ""    
 3 134739… 13549967… 2021-01-29 03:36:17 CoderRetwe… "Dat… ""    
 4 134739… 13549166… 2021-01-28 22:18:01 CoderRetwe… "Lin… ""    
 5 134739… 13549228… 2021-01-28 22:42:27 CoderRetwe… "The… ""    
 6 134739… 13549681… 2021-01-29 01:42:42 CoderRetwe… "Top… ""    
 7 126705… 13550082… 2021-01-29 04:22:00 _codenewbi… "Inf… "Code…
 8 126705… 13549174… 2021-01-28 22:21:00 _codenewbi… "Lin… "Code…
 9 126705… 13549778… 2021-01-29 02:21:00 _codenewbi… "#Da… "Code…
10 105164… 13549780… 2021-01-29 02:21:52 Fabriciosx  "⭕ I… "twit…
# … with 90 more rows, and 84 more variables: display_text_width <dbl>,
#   reply_to_status_id <lgl>, reply_to_user_id <lgl>,
#   reply_to_screen_name <lgl>, is_quote <lgl>, is_retweet <lgl>,
#   favorite_count <int>, retweet_count <int>, quote_count <int>,
#   reply_count <int>, hashtags <list>, symbols <list>, urls_url <list>,
#   urls_t.co <list>, urls_expanded_url <list>, media_url <list>,
#   media_t.co <list>, media_expanded_url <list>, media_type <list>,
#   ext_media_url <list>, ext_media_t.co <list>, ext_media_expanded_url <list>,
#   ext_media_type <chr>, mentions_user_id <list>, mentions_screen_name <list>,
#   lang <chr>, quoted_status_id <chr>, quoted_text <chr>,
#   quoted_created_at <dttm>, quoted_source <chr>, quoted_favorite_count <int>,
#   quoted_retweet_count <int>, quoted_user_id <chr>, quoted_screen_name <chr>,
#   quoted_name <chr>, quoted_followers_count <int>,
#   quoted_friends_count <int>, quoted_statuses_count <int>,
#   quoted_location <chr>, quoted_description <chr>, quoted_verified <lgl>,
#   retweet_status_id <chr>, retweet_text <chr>, retweet_created_at <dttm>,
#   retweet_source <chr>, retweet_favorite_count <int>,
#   retweet_retweet_count <int>, retweet_user_id <chr>,
#   retweet_screen_name <chr>, retweet_name <chr>,
#   retweet_followers_count <int>, retweet_friends_count <int>,
#   retweet_statuses_count <int>, retweet_location <chr>,
#   retweet_description <chr>, retweet_verified <lgl>, place_url <chr>,
#   place_name <chr>, place_full_name <chr>, place_type <chr>, country <chr>,
#   country_code <chr>, geo_coords <list>, coords_coords <list>,
#   bbox_coords <list>, status_url <chr>, name <chr>, location <chr>,
#   description <chr>, url <chr>, protected <lgl>, followers_count <int>,
#   friends_count <int>, listed_count <int>, statuses_count <int>,
#   favourites_count <int>, account_created_at <dttm>, verified <lgl>,
#   profile_url <chr>, profile_expanded_url <chr>, account_lang <lgl>,
#   profile_banner_url <chr>, profile_background_url <chr>,
#   profile_image_url <chr>

increasing number of results

  • search_tweets() returns 100 most recent matching tweets by default

  • Increase n to return more (tip: use intervals of 100)

rbeer <- search_tweets("beer", n = 500)
#can be up to n = 18000
rbeer
# A tibble: 500 x 90
   user_id status_id created_at          screen_name text  source
   <chr>   <chr>     <dttm>              <chr>       <chr> <chr> 
 1 727469… 13550189… 2021-01-29 05:04:30 frontasal20 "@be… Twitt…
 2 110307… 13550189… 2021-01-29 05:04:30 92089204a   "【残り… Twitt…
 3 757239… 13550189… 2021-01-29 05:04:28 beer_naabo  "建売新… Twitt…
 4 757239… 13550176… 2021-01-29 04:59:29 beer_naabo  "じいち… Twitt…
 5 757239… 13550179… 2021-01-29 05:00:32 beer_naabo  "晴れて… Twitt…
 6 757239… 13550182… 2021-01-29 05:01:50 beer_naabo  "ブーブ… Twitt…
 7 517559… 13550182… 2021-01-29 05:01:40 Kirin_Brew… "@ma… Belug…
 8 517559… 13550172… 2021-01-29 04:57:34 Kirin_Brew… "@ar… Belug…
 9 517559… 13550174… 2021-01-29 04:58:21 Kirin_Brew… "@ma… Belug…
10 517559… 13550188… 2021-01-29 05:04:10 Kirin_Brew… "@ya… Belug…
# … with 490 more rows, and 84 more variables: display_text_width <dbl>,
#   reply_to_status_id <chr>, reply_to_user_id <chr>,
#   reply_to_screen_name <chr>, is_quote <lgl>, is_retweet <lgl>,
#   favorite_count <int>, retweet_count <int>, quote_count <int>,
#   reply_count <int>, hashtags <list>, symbols <list>, urls_url <list>,
#   urls_t.co <list>, urls_expanded_url <list>, media_url <list>,
#   media_t.co <list>, media_expanded_url <list>, media_type <list>,
#   ext_media_url <list>, ext_media_t.co <list>, ext_media_expanded_url <list>,
#   ext_media_type <chr>, mentions_user_id <list>, mentions_screen_name <list>,
#   lang <chr>, quoted_status_id <chr>, quoted_text <chr>,
#   quoted_created_at <dttm>, quoted_source <chr>, quoted_favorite_count <int>,
#   quoted_retweet_count <int>, quoted_user_id <chr>, quoted_screen_name <chr>,
#   quoted_name <chr>, quoted_followers_count <int>,
#   quoted_friends_count <int>, quoted_statuses_count <int>,
#   quoted_location <chr>, quoted_description <chr>, quoted_verified <lgl>,
#   retweet_status_id <chr>, retweet_text <chr>, retweet_created_at <dttm>,
#   retweet_source <chr>, retweet_favorite_count <int>,
#   retweet_retweet_count <int>, retweet_user_id <chr>,
#   retweet_screen_name <chr>, retweet_name <chr>,
#   retweet_followers_count <int>, retweet_friends_count <int>,
#   retweet_statuses_count <int>, retweet_location <chr>,
#   retweet_description <chr>, retweet_verified <lgl>, place_url <chr>,
#   place_name <chr>, place_full_name <chr>, place_type <chr>, country <chr>,
#   country_code <chr>, geo_coords <list>, coords_coords <list>,
#   bbox_coords <list>, status_url <chr>, name <chr>, location <chr>,
#   description <chr>, url <chr>, protected <lgl>, followers_count <int>,
#   friends_count <int>, listed_count <int>, statuses_count <int>,
#   favourites_count <int>, account_created_at <dttm>, verified <lgl>,
#   profile_url <chr>, profile_expanded_url <chr>, account_lang <lgl>,
#   profile_banner_url <chr>, profile_background_url <chr>,
#   profile_image_url <chr>

Please be mindful that you have a rate limit of 18,000 per fifteen minutes, which means you can only pull this much in one search and will get errors after that for 15 min

getting a lot more tweets

PRO TIP #1: Get the firehose for free by searching for tweets by verified or non-verified tweets

fff <- search_tweets("filter:verified OR -filter:verified", n = 3000) #could be n = 18000
fff
# A tibble: 2,889 x 90
   user_id status_id created_at          screen_name text  source
   <chr>   <chr>     <dttm>              <chr>       <chr> <chr> 
 1 131795… 13550189… 2021-01-29 05:04:36 z902pWuSsc… "@nu… Twitt…
 2 488373… 13550189… 2021-01-29 05:04:36 Rumefeller  "@2a… Twitt…
 3 127828… 13550189… 2021-01-29 05:04:36 34gr_       "أخر… Twitt…
 4 956079… 13550189… 2021-01-29 05:04:36 LesNadines  "@BF… Twitt…
 5 969528… 13550189… 2021-01-29 05:04:36 gurnd_blue  "2C6… グランブル…
 6 436210… 13550189… 2021-01-29 05:04:36 StahlAmy    "On … Goodr…
 7 788805… 13550189… 2021-01-29 05:04:36 GNSGRadio   "Now… GNSG …
 8 128852… 13550189… 2021-01-29 05:04:36 waengnamja  "@JI… Twitt…
 9 128863… 13550189… 2021-01-29 05:04:36 Seokjinnie… ".@B… Twitt…
10 132976… 13550189… 2021-01-29 05:04:36 shoakunoko… "トイレ… Twitt…
# … with 2,879 more rows, and 84 more variables: display_text_width <dbl>,
#   reply_to_status_id <chr>, reply_to_user_id <chr>,
#   reply_to_screen_name <chr>, is_quote <lgl>, is_retweet <lgl>,
#   favorite_count <int>, retweet_count <int>, quote_count <int>,
#   reply_count <int>, hashtags <list>, symbols <list>, urls_url <list>,
#   urls_t.co <list>, urls_expanded_url <list>, media_url <list>,
#   media_t.co <list>, media_expanded_url <list>, media_type <list>,
#   ext_media_url <list>, ext_media_t.co <list>, ext_media_expanded_url <list>,
#   ext_media_type <chr>, mentions_user_id <list>, mentions_screen_name <list>,
#   lang <chr>, quoted_status_id <chr>, quoted_text <chr>,
#   quoted_created_at <dttm>, quoted_source <chr>, quoted_favorite_count <int>,
#   quoted_retweet_count <int>, quoted_user_id <chr>, quoted_screen_name <chr>,
#   quoted_name <chr>, quoted_followers_count <int>,
#   quoted_friends_count <int>, quoted_statuses_count <int>,
#   quoted_location <chr>, quoted_description <chr>, quoted_verified <lgl>,
#   retweet_status_id <chr>, retweet_text <chr>, retweet_created_at <dttm>,
#   retweet_source <chr>, retweet_favorite_count <int>,
#   retweet_retweet_count <int>, retweet_user_id <chr>,
#   retweet_screen_name <chr>, retweet_name <chr>,
#   retweet_followers_count <int>, retweet_friends_count <int>,
#   retweet_statuses_count <int>, retweet_location <chr>,
#   retweet_description <chr>, retweet_verified <lgl>, place_url <chr>,
#   place_name <chr>, place_full_name <chr>, place_type <chr>, country <chr>,
#   country_code <chr>, geo_coords <list>, coords_coords <list>,
#   bbox_coords <list>, status_url <chr>, name <chr>, location <chr>,
#   description <chr>, url <chr>, protected <lgl>, followers_count <int>,
#   friends_count <int>, listed_count <int>, statuses_count <int>,
#   favourites_count <int>, account_created_at <dttm>, verified <lgl>,
#   profile_url <chr>, profile_expanded_url <chr>, account_lang <lgl>,
#   profile_banner_url <chr>, profile_background_url <chr>,
#   profile_image_url <chr>

plotting tweets

Visualize second-by-second frequency

ts_plot(fff, "secs")

ts_plot(dplyr::group_by(fff, is_retweet), "secs")

twitter search operators

You can combine any of the above commands to extract what you are searching for.

PRO TIP #2: Use search operators provided by Twitter, e.g.,

  • filter by language and exclude retweets and replies
rt <- search_tweets("tacos", lang = "en", 
  include_rts = FALSE, `-filter` = "replies")
  • filter only tweets linking to news articles
nws <- search_tweets("filter:news")

filtering in search_tweets

  • filter only tweets that contain links
links <- search_tweets("filter:links")
links
# A tibble: 100 x 90
   user_id status_id created_at          screen_name text  source
   <chr>   <chr>     <dttm>              <chr>       <chr> <chr> 
 1 816818… 13550190… 2021-01-29 05:05:02 MyYedammm   "สรุ…  Twitt…
 2 129630… 13550190… 2021-01-29 05:05:02 s125osanm   "@Ke… Twitt…
 3 290155… 13550190… 2021-01-29 05:05:02 XO_jellyDO  "최수종… Twitt…
 4 124856… 13550190… 2021-01-29 05:05:02 fallingflo… "190… Twitt…
 5 101297… 13550190… 2021-01-29 05:05:02 NorthAjith… "ஆகஸ… Twitt…
 6 881903… 13550190… 2021-01-29 05:05:02 A_ightK     "칠흑만… Twitt…
 7 774234… 13550190… 2021-01-29 05:05:02 kmlsantos_  "#BT… Twitt…
 8 855390… 13550190… 2021-01-29 05:05:02 Yeriel_hei  "CIA… Twitt…
 9 112664… 13550190… 2021-01-29 05:05:02 tongkonnee  "อี้…   Twitt…
10 109945… 13550190… 2021-01-29 05:05:02 Cerinn_n    "อยา… Twitt…
# … with 90 more rows, and 84 more variables: display_text_width <dbl>,
#   reply_to_status_id <lgl>, reply_to_user_id <lgl>,
#   reply_to_screen_name <lgl>, is_quote <lgl>, is_retweet <lgl>,
#   favorite_count <int>, retweet_count <int>, quote_count <int>,
#   reply_count <int>, hashtags <list>, symbols <list>, urls_url <list>,
#   urls_t.co <list>, urls_expanded_url <list>, media_url <list>,
#   media_t.co <list>, media_expanded_url <list>, media_type <list>,
#   ext_media_url <list>, ext_media_t.co <list>, ext_media_expanded_url <list>,
#   ext_media_type <chr>, mentions_user_id <list>, mentions_screen_name <list>,
#   lang <chr>, quoted_status_id <chr>, quoted_text <chr>,
#   quoted_created_at <dttm>, quoted_source <chr>, quoted_favorite_count <int>,
#   quoted_retweet_count <int>, quoted_user_id <chr>, quoted_screen_name <chr>,
#   quoted_name <chr>, quoted_followers_count <int>,
#   quoted_friends_count <int>, quoted_statuses_count <int>,
#   quoted_location <chr>, quoted_description <chr>, quoted_verified <lgl>,
#   retweet_status_id <chr>, retweet_text <chr>, retweet_created_at <dttm>,
#   retweet_source <chr>, retweet_favorite_count <int>,
#   retweet_retweet_count <int>, retweet_user_id <chr>,
#   retweet_screen_name <chr>, retweet_name <chr>,
#   retweet_followers_count <int>, retweet_friends_count <int>,
#   retweet_statuses_count <int>, retweet_location <chr>,
#   retweet_description <chr>, retweet_verified <lgl>, place_url <chr>,
#   place_name <chr>, place_full_name <chr>, place_type <chr>, country <chr>,
#   country_code <chr>, geo_coords <list>, coords_coords <list>,
#   bbox_coords <list>, status_url <chr>, name <chr>, location <chr>,
#   description <chr>, url <chr>, protected <lgl>, followers_count <int>,
#   friends_count <int>, listed_count <int>, statuses_count <int>,
#   favourites_count <int>, account_created_at <dttm>, verified <lgl>,
#   profile_url <chr>, profile_expanded_url <chr>, account_lang <lgl>,
#   profile_banner_url <chr>, profile_background_url <chr>,
#   profile_image_url <chr>
  • filter only tweets that contain video
vids <- search_tweets("filter:video")
vids
# A tibble: 82 x 90
   user_id status_id created_at          screen_name text  source
   <chr>   <chr>     <dttm>              <chr>       <chr> <chr> 
 1 118258… 13549601… 2021-01-29 01:10:40 xiaohuangs  "Mir… Twitt…
 2 366374… 13548703… 2021-01-28 19:13:58 RedoxRidwan "Tem… Twitt…
 3 133365… 13548694… 2021-01-28 19:10:22 SlashyTiger "@re… Twitt…
 4 129144… 13548674… 2021-01-28 19:02:17 Darasim984… "Tem… Twitt…
 5 126710… 13548518… 2021-01-28 18:00:34 SBB_2K      "New… Twitt…
 6 126710… 13524191… 2021-01-22 00:53:48 SBB_2K      "*IM… Twitt…
 7 149048… 13548304… 2021-01-28 16:35:18 MrBtheNige… "Tem… Twitt…
 8 287782… 13548170… 2021-01-28 15:42:11 mackeankan… "Tem… Twitt…
 9 462344… 13548141… 2021-01-28 15:30:37 Kloinsoffi… "Tem… Twitt…
10 373163… 13548035… 2021-01-28 14:48:33 AbdulsamodA "Tem… Twitt…
# … with 72 more rows, and 84 more variables: display_text_width <dbl>,
#   reply_to_status_id <chr>, reply_to_user_id <chr>,
#   reply_to_screen_name <chr>, is_quote <lgl>, is_retweet <lgl>,
#   favorite_count <int>, retweet_count <int>, quote_count <int>,
#   reply_count <int>, hashtags <list>, symbols <list>, urls_url <list>,
#   urls_t.co <list>, urls_expanded_url <list>, media_url <list>,
#   media_t.co <list>, media_expanded_url <list>, media_type <list>,
#   ext_media_url <list>, ext_media_t.co <list>, ext_media_expanded_url <list>,
#   ext_media_type <chr>, mentions_user_id <list>, mentions_screen_name <list>,
#   lang <chr>, quoted_status_id <chr>, quoted_text <chr>,
#   quoted_created_at <dttm>, quoted_source <chr>, quoted_favorite_count <int>,
#   quoted_retweet_count <int>, quoted_user_id <chr>, quoted_screen_name <chr>,
#   quoted_name <chr>, quoted_followers_count <int>,
#   quoted_friends_count <int>, quoted_statuses_count <int>,
#   quoted_location <chr>, quoted_description <chr>, quoted_verified <lgl>,
#   retweet_status_id <chr>, retweet_text <chr>, retweet_created_at <dttm>,
#   retweet_source <chr>, retweet_favorite_count <int>,
#   retweet_retweet_count <int>, retweet_user_id <chr>,
#   retweet_screen_name <chr>, retweet_name <chr>,
#   retweet_followers_count <int>, retweet_friends_count <int>,
#   retweet_statuses_count <int>, retweet_location <chr>,
#   retweet_description <chr>, retweet_verified <lgl>, place_url <chr>,
#   place_name <chr>, place_full_name <chr>, place_type <chr>, country <chr>,
#   country_code <chr>, geo_coords <list>, coords_coords <list>,
#   bbox_coords <list>, status_url <chr>, name <chr>, location <chr>,
#   description <chr>, url <chr>, protected <lgl>, followers_count <int>,
#   friends_count <int>, listed_count <int>, statuses_count <int>,
#   favourites_count <int>, account_created_at <dttm>, verified <lgl>,
#   profile_url <chr>, profile_expanded_url <chr>, account_lang <lgl>,
#   profile_banner_url <chr>, profile_background_url <chr>,
#   profile_image_url <chr>

tweets sent by screennames

  • filter only tweets sent from:{screen_name} or to:{screen_name} certain users
## vector of screen names
users <- c("cnnbrk", "AP", "nytimes", 
  "foxnews", "msnbc", "seanhannity", "maddow")
## then use search_tweets
tousers <- search_tweets(paste0("from:", users, collapse = " OR "))
tousers
# A tibble: 100 x 90
   user_id status_id created_at          screen_name text  source
   <chr>   <chr>     <dttm>              <chr>       <chr> <chr> 
 1 2836421 13550183… 2021-01-29 05:02:07 MSNBC       "\"T… Socia…
 2 2836421 13549700… 2021-01-29 01:50:05 MSNBC       ".@F… Socia…
 3 2836421 13549899… 2021-01-29 03:09:14 MSNBC       "TUN… Tweet…
 4 2836421 13550149… 2021-01-29 04:48:41 MSNBC       "Gam… Tweet…
 5 2836421 13549765… 2021-01-29 02:16:04 MSNBC       "Dem… Socia…
 6 2836421 13549685… 2021-01-29 01:44:05 MSNBC       "Som… Socia…
 7 2836421 13550138… 2021-01-29 04:44:05 MSNBC       "WAT… Wildm…
 8 2836421 13549730… 2021-01-29 02:02:03 MSNBC       "Dr.… Socia…
 9 2836421 13550108… 2021-01-29 04:32:04 MSNBC       "\"I… Socia…
10 2836421 13549830… 2021-01-29 02:41:43 MSNBC       "Liv… Wildm…
# … with 90 more rows, and 84 more variables: display_text_width <dbl>,
#   reply_to_status_id <chr>, reply_to_user_id <chr>,
#   reply_to_screen_name <chr>, is_quote <lgl>, is_retweet <lgl>,
#   favorite_count <int>, retweet_count <int>, quote_count <int>,
#   reply_count <int>, hashtags <list>, symbols <list>, urls_url <list>,
#   urls_t.co <list>, urls_expanded_url <list>, media_url <list>,
#   media_t.co <list>, media_expanded_url <list>, media_type <list>,
#   ext_media_url <list>, ext_media_t.co <list>, ext_media_expanded_url <list>,
#   ext_media_type <chr>, mentions_user_id <list>, mentions_screen_name <list>,
#   lang <chr>, quoted_status_id <chr>, quoted_text <chr>,
#   quoted_created_at <dttm>, quoted_source <chr>, quoted_favorite_count <int>,
#   quoted_retweet_count <int>, quoted_user_id <chr>, quoted_screen_name <chr>,
#   quoted_name <chr>, quoted_followers_count <int>,
#   quoted_friends_count <int>, quoted_statuses_count <int>,
#   quoted_location <chr>, quoted_description <chr>, quoted_verified <lgl>,
#   retweet_status_id <chr>, retweet_text <chr>, retweet_created_at <dttm>,
#   retweet_source <chr>, retweet_favorite_count <int>,
#   retweet_retweet_count <int>, retweet_user_id <chr>,
#   retweet_screen_name <chr>, retweet_name <chr>,
#   retweet_followers_count <int>, retweet_friends_count <int>,
#   retweet_statuses_count <int>, retweet_location <chr>,
#   retweet_description <chr>, retweet_verified <lgl>, place_url <chr>,
#   place_name <chr>, place_full_name <chr>, place_type <chr>, country <chr>,
#   country_code <chr>, geo_coords <list>, coords_coords <list>,
#   bbox_coords <list>, status_url <chr>, name <chr>, location <chr>,
#   description <chr>, url <chr>, protected <lgl>, followers_count <int>,
#   friends_count <int>, listed_count <int>, statuses_count <int>,
#   favourites_count <int>, account_created_at <dttm>, verified <lgl>,
#   profile_url <chr>, profile_expanded_url <chr>, account_lang <lgl>,
#   profile_banner_url <chr>, profile_background_url <chr>,
#   profile_image_url <chr>

searching only verified accounts

  • filter only tweets with at least 100 favorites or 100 retweets
pop <- search_tweets(
  "(filter:verified OR -filter:verified) (min_faves:100 OR min_retweets:100)")
  • filter by the type of device that posted the tweet.
rt <- search_tweets("lang:en", source = '"Twitter for iPhone"')

search_tweets() with location

Search by geolocation (ex: tweets within 25 miles of Durham University)

durham25 <- search_tweets(
  geocode = "54.7649859,-1.5803916,25mi", n = 500
)
durham25
# A tibble: 500 x 90
   user_id status_id created_at          screen_name text  source
   <chr>   <chr>     <dttm>              <chr>       <chr> <chr> 
 1 745272… 13550190… 2021-01-29 05:04:49 PoetDeanwi… "@BT… Twitt…
 2 108111… 13550190… 2021-01-29 05:04:46 layzhangbe… "htt… Twitt…
 3 189566… 13550189… 2021-01-29 05:04:36 workidabbz  "@Ce… Twitt…
 4 189566… 13550078… 2021-01-29 04:20:23 workidabbz  "@hi… Twitt…
 5 189566… 13550136… 2021-01-29 04:43:19 workidabbz  "@Ce… Twitt…
 6 189566… 13550185… 2021-01-29 05:03:00 workidabbz  "@_e… Twitt…
 7 102308… 13550189… 2021-01-29 05:04:26 IRISHPACER_ "Can… Twitt…
 8 102308… 13550189… 2021-01-29 05:04:26 IRISHPACER_ "Ims… Twitt…
 9 134901… 13550188… 2021-01-29 05:04:06 Nichola620… "@ho… Twitt…
10 134243… 13550188… 2021-01-29 05:03:58 bibenson2   "Dav… Twitt…
# … with 490 more rows, and 84 more variables: display_text_width <dbl>,
#   reply_to_status_id <chr>, reply_to_user_id <chr>,
#   reply_to_screen_name <chr>, is_quote <lgl>, is_retweet <lgl>,
#   favorite_count <int>, retweet_count <int>, quote_count <int>,
#   reply_count <int>, hashtags <list>, symbols <list>, urls_url <list>,
#   urls_t.co <list>, urls_expanded_url <list>, media_url <list>,
#   media_t.co <list>, media_expanded_url <list>, media_type <list>,
#   ext_media_url <list>, ext_media_t.co <list>, ext_media_expanded_url <list>,
#   ext_media_type <chr>, mentions_user_id <list>, mentions_screen_name <list>,
#   lang <chr>, quoted_status_id <chr>, quoted_text <chr>,
#   quoted_created_at <dttm>, quoted_source <chr>, quoted_favorite_count <int>,
#   quoted_retweet_count <int>, quoted_user_id <chr>, quoted_screen_name <chr>,
#   quoted_name <chr>, quoted_followers_count <int>,
#   quoted_friends_count <int>, quoted_statuses_count <int>,
#   quoted_location <chr>, quoted_description <chr>, quoted_verified <lgl>,
#   retweet_status_id <chr>, retweet_text <chr>, retweet_created_at <dttm>,
#   retweet_source <chr>, retweet_favorite_count <int>,
#   retweet_retweet_count <int>, retweet_user_id <chr>,
#   retweet_screen_name <chr>, retweet_name <chr>,
#   retweet_followers_count <int>, retweet_friends_count <int>,
#   retweet_statuses_count <int>, retweet_location <chr>,
#   retweet_description <chr>, retweet_verified <lgl>, place_url <chr>,
#   place_name <chr>, place_full_name <chr>, place_type <chr>, country <chr>,
#   country_code <chr>, geo_coords <list>, coords_coords <list>,
#   bbox_coords <list>, status_url <chr>, name <chr>, location <chr>,
#   description <chr>, url <chr>, protected <lgl>, followers_count <int>,
#   friends_count <int>, listed_count <int>, statuses_count <int>,
#   favourites_count <int>, account_created_at <dttm>, verified <lgl>,
#   profile_url <chr>, profile_expanded_url <chr>, account_lang <lgl>,
#   profile_banner_url <chr>, profile_background_url <chr>,
#   profile_image_url <chr>

mapping geotagged tweets

Use lat_lng() to convert geographical data into lat and lng variables (single point)

setting up some basic parameters

I used Google Maps to get the Lat/Long of Durham. I set these as variables so that I could later just pull from them.

#lat and long of durham
xlong<--1.5803916
ylat<-54.7649859

# Where in maps database is this lat and long? (create a variable for this)
region<-map.where(database = "world", xlong,ylat )

Mapping the geotagged tweets

#create lat/lng variables using all available tweet and profile geo-location data
durham25 <- lat_lng(durham25)

#notice how I use the region variable I created above and add to the xlong/ylat variables to set my extents?

maps::map("world",regions = region, fill = TRUE, col = "#ffffff", lwd = .25, mar = c(0, 0, 0, 0), xlim = c((xlong-5), (xlong+5)), y = c(ylat-5, ylat+5))
with(durham25, points(lng, lat, pch = 20, col = "red"))

This code plots geotagged tweets within 25 miles of Durham on a map of the UK

Please note if you were making a map of the United States the maps::map() has 3 databases for the USA and only one for “world” see help(package='maps') for more details.

searching in an entire country

Search by geo-location—for example, find 10,000 tweets in the English language sent from the United States. Note: some countries and cities are hardcoded in the API, while sometimes lookup_coords() requires users have a Google API key

search for 5,000 tweets in english, sent from the US

usa <- search_tweets(
  "lang:en", geocode = lookup_coords("usa"), n = 5000
)

These tweets are all geotagged. We’ll discuss more about geographic identifiers later.

Week 29: more on the Twitter API

Other things we can collect

Last week we discussed ways to use the search_tweet fucntion of the Twitter API. The search_tweet only allows you to go backward in time, and it can only collect data from Twitter’s Tweet object model

Depending on what you are trying to collect, you may need to try one of these other funcitons.

User timelines with get_timeline()

Get the 100 most recent tweets posted by an individual user.

du <- get_timeline("durham_uni")

Get up to the most recent 3,200 tweets (endpoint max) posted by multiple users.

unis <- get_timeline(c("durham_uni", "NCL_Geography", "oiioxford"), n = 3200)

Plotting those

Group by screen_name and plot hourly frequencies of tweets.

Remember that %>% is a pipe and may not work if you have not loaded the dplyr library

unis %>%
  dplyr::group_by(screen_name) %>%
  ts_plot("hours")

User favorites with get_favorites()

Get up to the most recent 3,000 tweets favorited/liked by a user. This is the posts a user has clicked the heart button on.

Please note that it has to be the US spelling of “favorites”

dugeog_fav <- get_favorites("GeogDurham", n = 3000)

Lookup statuses with lookup_tweets()

If you look in any of the dataframes in your ‘global environment’ (such as dugeog_fav or unis), you’ll notice that one of the attributes you’ve been pulling is the status_id of the Tweet.

This is like the identification number (or phone number) of that particular tweet.

## `lookup_tweets()`
status_ids <- c("1259377636146585600", "1190195226972962816",
  "1329132279264923650", "1111268201780989952")
twt <- lookup_tweets(status_ids)

Getting the users’ network

Friends/followers

Twitter’s API documentation distinguishes between friends and followers.

  • Friend refers to an account a given user follows
  • Follower refers to an account following a given user

Pulling a users’ friends get_friends()

Get user IDs of accounts followed by (AKA friends) [@jack](https://twitter.com/jack), the co-founder and CEO of Twitter.

fds <- get_friends("jack")
fds
# A tibble: 4,538 x 2
   user  user_id            
   <chr> <chr>              
 1 jack  1354898820400877571
 2 jack  27058194           
 3 jack  257436924          
 4 jack  1282418324228337665
 5 jack  2190757022         
 6 jack  3291691            
 7 jack  14918591           
 8 jack  14115083           
 9 jack  38113183           
10 jack  536429909          
# … with 4,528 more rows

multiple users’ friends

Get friends of multiple users in a single call.

fds <- get_friends(
  c("durham_uni", "NCL_Geography", "oiioxford")
)
fds
# A tibble: 4,433 x 2
   user       user_id            
   <chr>      <chr>              
 1 durham_uni 270004438          
 2 durham_uni 34918353           
 3 durham_uni 4861601645         
 4 durham_uni 2330017078         
 5 durham_uni 14494181           
 6 durham_uni 1122968208389099520
 7 durham_uni 1016236315845824512
 8 durham_uni 57645871           
 9 durham_uni 186104486          
10 durham_uni 20324317           
# … with 4,423 more rows

get_followers()

Get user IDs of accounts following (AKA followers) [@GeogDurham](https://twitter.com/GeogDurham).

dudept_follow <- get_followers("GeogDurham")
dudept_follow
# A tibble: 3,554 x 1
   user_id            
   <chr>              
 1 3165094085         
 2 1354923680992792587
 3 1296509996092588034
 4 275015329          
 5 238203281          
 6 732612019283709954 
 7 316210273          
 8 1354503938628837376
 9 835404726          
10 1354381953848512515
# … with 3,544 more rows

large numbers of followers

get_followers()

Unlike friends (limited by Twitter to 5,000), there is no limit on the number of followers.

To get user IDs of all 64(ish) million followers of Justin Timberlake (@jtimberlake, you need two things:

  1. A stable internet connection
  2. Time – approximately seven days

It’s probably not a good idea to harvest an account like @jtimberlake unless you really need it for research.

But here is how you would do it anyway.

Get all of Justin Timberlake’s followers.

## seriously don't try this for fun
rdt <- get_followers(
  "jtimberlake", 
  n = 64100000, 
  retryonratelimit = TRUE
)

Lookup users

the lookup_users()function of rtweet looks at the the users’ profile location by pulling from the user object of the Twitter API

Lookup users-level (and most recent tweet) information associated with vector of user_id or screen_name (you can use either)

## vector of users
users <- c("durham_uni", "NCL_Geography", "oiioxford")

## lookup users twitter data
usr <- lookup_users(users)
usr
# A tibble: 3 x 90
  user_id status_id created_at          screen_name text  source
  <chr>   <chr>     <dttm>              <chr>       <chr> <chr> 
1 277391… 13548366… 2021-01-28 17:00:07 durham_uni  "📢 W… "Falc…
2 222989… 13548268… 2021-01-28 16:21:07 NCL_Geogra… "Has… "Twit…
3 499764… 13549006… 2021-01-28 21:14:18 oiioxford   "Eth… "Twit…
# … with 84 more variables: display_text_width <int>, reply_to_status_id <lgl>,
#   reply_to_user_id <lgl>, reply_to_screen_name <lgl>, is_quote <lgl>,
#   is_retweet <lgl>, favorite_count <int>, retweet_count <int>,
#   quote_count <int>, reply_count <int>, hashtags <list>, symbols <list>,
#   urls_url <list>, urls_t.co <list>, urls_expanded_url <list>,
#   media_url <list>, media_t.co <list>, media_expanded_url <list>,
#   media_type <list>, ext_media_url <list>, ext_media_t.co <list>,
#   ext_media_expanded_url <list>, ext_media_type <chr>,
#   mentions_user_id <list>, mentions_screen_name <list>, lang <chr>,
#   quoted_status_id <chr>, quoted_text <chr>, quoted_created_at <dttm>,
#   quoted_source <chr>, quoted_favorite_count <int>,
#   quoted_retweet_count <int>, quoted_user_id <chr>, quoted_screen_name <chr>,
#   quoted_name <chr>, quoted_followers_count <int>,
#   quoted_friends_count <int>, quoted_statuses_count <int>,
#   quoted_location <chr>, quoted_description <chr>, quoted_verified <lgl>,
#   retweet_status_id <chr>, retweet_text <chr>, retweet_created_at <dttm>,
#   retweet_source <chr>, retweet_favorite_count <int>,
#   retweet_retweet_count <int>, retweet_user_id <chr>,
#   retweet_screen_name <chr>, retweet_name <chr>,
#   retweet_followers_count <int>, retweet_friends_count <int>,
#   retweet_statuses_count <int>, retweet_location <chr>,
#   retweet_description <chr>, retweet_verified <lgl>, place_url <chr>,
#   place_name <chr>, place_full_name <chr>, place_type <chr>, country <chr>,
#   country_code <chr>, geo_coords <list>, coords_coords <list>,
#   bbox_coords <list>, status_url <chr>, name <chr>, location <chr>,
#   description <chr>, url <chr>, protected <lgl>, followers_count <int>,
#   friends_count <int>, listed_count <int>, statuses_count <int>,
#   favourites_count <int>, account_created_at <dttm>, verified <lgl>,
#   profile_url <chr>, profile_expanded_url <chr>, account_lang <lgl>,
#   profile_banner_url <chr>, profile_background_url <chr>,
#   profile_image_url <chr>

If you look through the output of usr, you’ll note that it pulls 90 different variables for the user, including their profile location.

search for users

Just as we searched for tweets at the beginning with the search_tweets() function, we can use the search_users() much the same way. Twitter will look for matches in user names, screen names, and profile bios.

## search for breaking news accounts
bkn <- search_users("breaking news")
bkn
# A tibble: 100 x 90
   user_id status_id created_at          screen_name text  source
   <chr>   <chr>     <dttm>              <chr>       <chr> <chr> 
 1 6017542 13549545… 2021-01-29 00:48:35 BreakingNe… "Leg… "Twee…
 2 5402612 13549135… 2021-01-28 22:05:42 BBCBreaking "Cov… "Soci…
 3 428333  13549553… 2021-01-29 00:51:43 cnnbrk      "Cic… "Soci…
 4 173187… 13548627… 2021-01-28 18:43:52 NationBrea… "LAI… "Twee…
 5 923263… 13017168… 2020-09-04 03:00:50 ftbreaking… "Chi… "Soci…
 6 189155… 13549854… 2021-01-29 02:51:27 ChicagoBre… "Bla… ""    
 7 981100… 13549575… 2021-01-29 01:00:38 MirrorBrea… "Pio… "Twee…
 8 141387… 13547825… 2021-01-28 13:24:56 TelegraphN… "🚨 I… "Echo…
 9 874167… 13549048… 2021-01-28 21:30:53 SkyNewsBre… "Pri… "Twee…
10 386230… 13550030… 2021-01-29 04:01:23 gmanewsbre… "Vac… "Twee…
# … with 90 more rows, and 84 more variables: display_text_width <dbl>,
#   reply_to_status_id <chr>, reply_to_user_id <chr>,
#   reply_to_screen_name <chr>, is_quote <lgl>, is_retweet <lgl>,
#   favorite_count <int>, retweet_count <int>, quote_count <int>,
#   reply_count <int>, hashtags <list>, symbols <list>, urls_url <list>,
#   urls_t.co <list>, urls_expanded_url <list>, media_url <list>,
#   media_t.co <list>, media_expanded_url <list>, media_type <list>,
#   ext_media_url <list>, ext_media_t.co <list>, ext_media_expanded_url <list>,
#   ext_media_type <chr>, mentions_user_id <list>, mentions_screen_name <list>,
#   lang <chr>, quoted_status_id <chr>, quoted_text <chr>,
#   quoted_created_at <dttm>, quoted_source <chr>, quoted_favorite_count <int>,
#   quoted_retweet_count <int>, quoted_user_id <chr>, quoted_screen_name <chr>,
#   quoted_name <chr>, quoted_followers_count <int>,
#   quoted_friends_count <int>, quoted_statuses_count <int>,
#   quoted_location <chr>, quoted_description <chr>, quoted_verified <lgl>,
#   retweet_status_id <chr>, retweet_text <chr>, retweet_created_at <dttm>,
#   retweet_source <chr>, retweet_favorite_count <int>,
#   retweet_retweet_count <int>, retweet_user_id <chr>,
#   retweet_screen_name <chr>, retweet_name <chr>,
#   retweet_followers_count <int>, retweet_friends_count <int>,
#   retweet_statuses_count <int>, retweet_location <chr>,
#   retweet_description <chr>, retweet_verified <lgl>, place_url <chr>,
#   place_name <chr>, place_full_name <chr>, place_type <chr>, country <chr>,
#   country_code <chr>, geo_coords <list>, coords_coords <list>,
#   bbox_coords <list>, status_url <chr>, name <chr>, location <chr>,
#   description <chr>, url <chr>, protected <lgl>, followers_count <int>,
#   friends_count <int>, listed_count <int>, statuses_count <int>,
#   favourites_count <int>, account_created_at <dttm>, verified <lgl>,
#   profile_url <chr>, profile_expanded_url <chr>, account_lang <lgl>,
#   profile_banner_url <chr>, profile_background_url <chr>,
#   profile_image_url <chr>

Lists

In my opinion List’s are not used all that often, and you don’t necessarily need this, so feel free to skip this one…

using lists_memberships()

  • Get an account’s list memberships (lists that include an account). Sorry I can’t think of a British account to put here… You all probably don’t read Nate Silver’s blog, fivethirtyeight.com but it really is excellent at using statistics to explain and predict politics, sports, and science.
## lists that include Nate Silver
nsl <- lists_memberships("NateSilver538")
nsl
# A tibble: 200 x 11
   list_id name  uri   subscriber_count member_count mode  description slug 
   <chr>   <chr> <chr>            <int>        <int> <chr> <chr>       <chr>
 1 135494… US E… /zmi…                0           13 publ… ""          us-e…
 2 135493… Poli… /Dis…                0           91 publ… ""          poli…
 3 135488… News… /sco…                0           40 publ… ""          news…
 4 135486… inte… /Lut…                0            7 publ… "Interesti… inte…
 5 135484… Poli… /tim…                0           48 publ… ""          poli…
 6 135482… Poli… /Jon…                0           29 publ… "All thing… poli…
 7 135482… Poli… /eva…                0            1 publ… ""          poli…
 8 135445… covi… /pis…                0           34 publ… ""          covi…
 9 135435… USA … /DMC…                0           12 publ… ""          usa-…
10 135426… 538   /par…                0            3 publ… ""          538-…
# … with 190 more rows, and 3 more variables: full_name <chr>,
#   created_at <dttm>, following <lgl>

lists_members()

  • Get all list members (accounts on a list)

You can refer to a list either by it’s name or by it’s list_id, but you have to have at least one of those.

slug is the name that displays on Twitter. You can identify a list by its slug instead of its numerical id. If you decide to do so, note that you’ll also have to specify the list owner using the owner_id or owner_user parameters.

## all members of a human geography list
cng <- lists_members(owner_user = "criticalens", slug = "Human-Geo")
cng
# A tibble: 50 x 40
   user_id name  screen_name location description url   protected
   <chr>   <chr> <chr>       <chr>    <chr>       <chr> <lgl>    
 1 128167… GSAG… gsag_aag    ""       "Graduate … http… FALSE    
 2 119332… The … theblackgeo ""       "A project… http… FALSE    
 3 988284… Dial… DialoguesHG ""       "Dialogues… http… FALSE    
 4 959102… Urba… GenUrbNetw… "Toront… "A SSHRC-f… http… FALSE    
 5 931286… ani … ani_Landau… "Naarm … "Researchi… <NA>  FALSE    
 6 890963… Lati… LatinxGeog  ""       "A group o… http… FALSE    
 7 869604… AAG … QTGAAG      "The Sp… "Twitter a… http… FALSE    
 8 867451… Plac… _pscollect… ""       "The Place… http… FALSE    
 9 861906… Mara… mara_ferre… "London… "Urban & h… http… FALSE    
10 846396… Arri… Arrianna_P… ""       "Medical G… http… FALSE    
# … with 40 more rows, and 33 more variables: followers_count <int>,
#   friends_count <int>, listed_count <int>, created_at <dttm>,
#   favourites_count <int>, utc_offset <lgl>, time_zone <lgl>,
#   geo_enabled <lgl>, verified <lgl>, statuses_count <int>, lang <lgl>,
#   contributors_enabled <lgl>, is_translator <lgl>,
#   is_translation_enabled <lgl>, profile_background_color <chr>,
#   profile_background_image_url <chr>,
#   profile_background_image_url_https <chr>, profile_background_tile <lgl>,
#   profile_image_url <chr>, profile_image_url_https <chr>,
#   profile_banner_url <chr>, profile_link_color <chr>,
#   profile_sidebar_border_color <chr>, profile_sidebar_fill_color <chr>,
#   profile_text_color <chr>, profile_use_background_image <lgl>,
#   has_extended_profile <lgl>, default_profile <lgl>,
#   default_profile_image <lgl>, following <lgl>, follow_request_sent <lgl>,
#   notifications <lgl>, translator_type <chr>

Exporting any one of these df’s

Let’s say you wanted a .csv of any of these data frames that you’ve created (the ones you can see a list of in your Global Environment).

At any time you can write this to a .csv

write_as_csv(dugeog_fav, "favs.csv")

Streaming tweets

This one is really important. This is likely what you will use most to capture an emerging issue.

This is what I use the most. It uses Twitter’s streaming API to just listen and harvest tweets as they happen.

Please note: it can not go back in time and retrieve tweets that have already happened, it can only go forward. But it’s very powerful as you can target it specifically and set it up to capture your data for you.

using stream_tweets() for a small random sample

Sampling: small random sample (~ 1%) of all publicly available tweets

The default is that this will timeout after 30 seconds of streaming. We’ll adjust that later.

ss <- stream_tweets("")

Filtering: search-like query (up to 400 keywords)

Keep in mind this is like an implicit OR statement between the terms, it will harvest anything with any one of the terms. These could be hashtags or words in the tweet itself.

sf <- stream_tweets("durham,tacos,geography,EatTheRich")

streaming from users

You could stream from specific users. You could list out all the users that you want to stream from as a vector.

Tracking: vector of user ids (you could have up to 5000 user_ids)

As the default is 30 seconds, this doesn’t really work unless they tweet within the 30 seconds. Unless this for someone like Donald Trump, this is pretty useless.

## user IDs from accounts with "breaking news" from above
st <- stream_tweets(bkn$user_id)

streaming tweets from a geographic area

Location: geographical coordinates (1-360 degree location boxes)

Let’s say you wanted to limit your stream to a specific city or geographic area. Perhaps you were interested in a local event that was going to happen.You would draw a bounding box on the earth and stream anything within that geographic window.

from OSM’s wiki: A bounding box (usually shortened to bbox) is an area defined by two longitudes and two latitudes, where:

Latitude is a decimal number between -90.0 and 90.0. Longitude is a decimal number between -180.0 and 180.0.

They usually follow the standard format of:

left,bottom,right,top min Longitude , min Latitude , max Longitude , max Latitude

For example, Greater London is enclosed by: 0.489,51.28,0.236,51.686

Let’s start by streaming all the tweets for the whole world

## world-wide bounding box
sl <- stream_tweets(c(-180, -90, 180, 90))

Please note that this returns a smaller number of tweets the open stream you did before because many people have privacy settings that prevent you from identifying their location.

Some folks made a handy tool to draw bounding boxes on earth and get the coordinates. Visit https://boundingbox.klokantech.com/ and put “csv” in the drop down. Draw and move the rectangle that you would want to harvest. Copy the line of coordinates it shows in the ‘CSV:’ window

for the random square I drew over the UK, Europe, and parts of North Africa: -12.1,17.8,42.0,58.2

Please note that if this area is too small and nobody is tweeting during the time you are streaming, there will be nothing returned.

## that random bounding box I drew above
randoUK <- stream_tweets(c(-12.1,17.8,42.0,58.2))

Again, note that this is only streaming for 30 seconds, so unless the area is huge or there are some very active users that geotag their tweets, this isn’t going to pull up much information

(side note: I’m also creating this in the middle of the night in Europe/UK/North Africa… when people are more likley to be sleeping than tweeting)

finding cities with lookup_coords()

A useful convenience function–though it now requires a Google Maps API key–for quickly looking up coordinates. You have to store a credit card number for most Google Maps API key functions, so we’re not going to do that.

To enable basic uses of the ‘lookup_coords()’ without requiring a Google Maps API key, a number of the major cities throughout the word and the ‘world’ and ‘usa’ are baked into this function. If ‘world’ is supplied then a bounding box of maximum latitutde/longitude values, i.e., c(-180, -90, 180, 90), and a center point c(0, 0) are returned. If ‘usa’ is supplied then estimates of the United States’ bounding box and mid-point are returned.

To specify a city, provide the city name followed by a space and then the US state abbreviation or country name. To see a list of all included cities, enter rtweet:::citycoords in the R console to see coordinates data.

Let’s try that:

rtweet:::citycoords
# A tibble: 747 x 3
   city                    lat    lng
   <chr>                 <dbl>  <dbl>
 1 aberdeen scotland      57.2  -2.15
 2 aberdeen               57.2  -2.15
 3 aberdeen scotland      57.2  -2.15
 4 adelaide australia    -34.9 139.  
 5 adelaide              -34.9 139.  
 6 adelaide australia    -34.9 139.  
 7 algiers algeria        36.8   3   
 8 algiers                36.8   3   
 9 algiers algeria        36.8   3   
10 amsterdam netherlands  52.4   4.88
# … with 737 more rows

For example, if you wanted to just stream tweets from London, it would look like this:

## stream tweets sent from london
luk1 <- stream_tweets(q = lookup_coords("London, UK"), timeout = 60)

## search tweets sent from london
luk2 <- search_tweets(geocode = lookup_coords("London, UK"), n = 1000)

Increasing the time ofstream_tweets()

The default duration for streams is thirty seconds timeout = 30 We can increase this time out by using timeout Please note that it measures in seconds, so

  • Specify specific stream duration in seconds
  • 60 seconds = 1 minute
  • 3600 seconds = 1 hour
  • Use math to figure out how long you want to stream for
## stream for 3 minutes
stm <- stream_tweets(timeout = 60 * 3)

indefinitely streaming tweets

Be careful with this because the file quickly becomes unwieldy. We’ll talk more about how to have stream_tweets() run indefinitely but save a different file every X minutes in future sessions.

Stream tweets indefinitely.

stream_tweets(timeout = Inf, 
  file_name = "myfilename.json",
  parse = FALSE)

stream_tweets()

Keep in mind that the results all of these stream_tweets() operations are just stored in the memory of your computer. You have not written anything to the hard disk to save it. If your computer crashes (or restarts) while streaming tweets, you’ve likely lost whatever your stream was capturing.

At the same time, if you are getting tons of errors (or RStudio is crashing) in this process it might be because you are trying to hold too large of a file in memory and your computer can’t handle this.

So, instead you can stream the tweets (as JSON data) directly to a text file.

You need to make sure you have set the working directory for your R session as this will be writing a file to your hard disk.

I keep a folder called “r” in my root directory, so that I can find these files when I need to.

As a side note, when I’m pulling a serious amount of data for research I try not to have this write to a folder that is in cloud storage (Dropbox, Box, Sharepoint, icloud, etc…)

setwd("~/folder/folder")
stream_tweets(timeout = 60 * 3, 
  file_name = "randomtweets3min.json",
  parse = FALSE)

Once this finishes (and your file is saved) you’ve got yourself a nice sample. It’s saved and it’s safe. Good work!

Pat yourself on the back, take a break, have a snack, drink some water. The hard part of this is over.

Let’s look at what we streamed

You can navigate to the file you saved above in your computers file system and open it if you’d like. I use Brackets (download at brackets.io) as a text editor, and it can’t open a file larger than 16MB. These streaming files get big quickly, and 3-minutes of streaming (in the middle of the night) produced a file of 50.4MB, so I can’t open it in a normal text-editor. If you are on a Windows computer, and have Notepad++ installed, that should be able to open it.

Either way, it’s not a big deal if you can’t open it and look at it because you’ll be processing it in R.

Read-in a streamed JSON file

You will be using the parse_stream command to import the .json file.

This command is really finicky and I often have problems if I’ve created an extra large file and one of the keys is corrupted or something.

What R has to do here is convert a .json file to a ‘flat’ file. The .json file has entire tables within individual cells (e.g. the user-attributes) and the flat file puts them as columns adjacent to each other in a table.

rj <- parse_stream("randomtweets3min.json")

Troubleshooting the read-in problem

If you get an error that looks like:

Error: lexical error: invalid char in json text. :"aoi_fj_hiraoka","name":",{"created_at":"Fri Jan 29 03:32:2 (right here) ------^

Do not panic.

You just need a different way to stream in the tweets. There are many different ways to do this.

Try installing the jsonlite package

# load the jsonlite library (you may need to install it)
library(jsonlite)

r_object <- fromJSON(readLines("randomtweets3min.json"))

Did you get another error? If so, we need to run a special function. Copy and paste the recover_stream.R script from Github user JBGRuber and then call it.

Here is my copy-paste of this function:

#' Recovers Twitter damaged stream data (JSON file) into parsed data frame.
#'
#' @param path Character, name of JSON file with data collected by
#'   \code{\link{stream_tweets}}.
#' @param dir Character, name of a directory where intermediate files are
#'   stored.
#' @param verbose Logical, should progress be displayed?
#'
#' @family stream tweets
recover_stream <- function(path, dir = NULL, verbose = TRUE) {

  # read file and split to tweets
  lines <- readChar(path, file.info(path)$size, useBytes = TRUE)
  tweets <- stringi::stri_split_fixed(lines, "\n{")[[1]]
  tweets[-1] <- paste0("{", tweets[-1])
  tweets <- tweets[!(tweets == "" | tweets == "{")]
  
  # remove misbehaving characters
  tweets <- gsub("\r", "", tweets, fixed = TRUE)
  tweets <- gsub("\n", "", tweets, fixed = TRUE)
  
  # write tweets to disk and try to read them in individually
  if (is.null(dir)) {
    dir <- paste0(tempdir(), "/tweets/")
    dir.create(dir, showWarnings = FALSE)
  }

  if (verbose) {
    pb <- progress::progress_bar$new(
      format = "Processing tweets [:bar] :percent, :eta remaining",
      total = length(tweets), clear = FALSE
    )
    pb$tick(0)
  }

  tweets_l <- lapply(tweets, function(t) {
    pb$tick()
    id <- unlist(stringi::stri_extract_first_regex(t, "(?<=id\":)\\d+(?=,)"))[1]
    f <- paste0(dir, id, ".json")
    writeLines(t, f, useBytes = TRUE)
    out <- tryCatch(rtweet::parse_stream(f),
                    error = function(e) {})
    if ("tbl_df" %in% class(out)) {
      return(out)
    } else {
      return(id)
    }
  })

  # test which ones failed
  test <- vapply(tweets_l, is.character, FUN.VALUE = logical(1L))
  bad_files <- unlist(tweets_l[test])

  # Let user decide what to do
  if (length(bad_files) > 0) {
    message("There were ", length(bad_files),
            " tweets with problems. Should they be copied to your working directory?")
    sel <- menu(c("no", "yes", "copy a list with status_ids"))
    if (sel == 2) {
      dir.create(paste0(getwd(), "/broken_tweets/"), showWarnings = FALSE)
      file.copy(
        from = paste0(dir, bad_files, ".json"),
        to = paste0(getwd(), "/broken_tweets/", bad_files, ".json")
      )
    } else if (sel == 3) {
      writeLines(bad_files, "broken_tweets.txt")
    }
  }

  # clean up
  unlink(dir, recursive = TRUE)

  # return good tweets
  return(dplyr::bind_rows(tweets_l[!test]))
}

Once you’ve run that, you’ll see over in your Global Environment that there is now a new function listed under “functions”

Now we can call that function

rs<-recover_stream("randomtweets3min.json")

This process is quite slow, and it will stop to ask you what you want to do with the problematic tweets. It’s difficult to use these and I find it’s not worth trying.