Summative Code Guide

Monica Stephens

2021-02-08

Social Research Methods Summative Assignment

Practical I–Crowdsourcing & Social Media

Spring, 2021

Staff

Summative Deadline: Thursday 25th Feb 2021 by 4:00pm

Length: 1 website/.html file (equivalent of 6-pages of mixed media) and a brief readme file

Deliverables:

  • A ‘readme’ document with your name, the title of your project, a brief description of your project along with a brief reflection on your experience producing it and any challenges you faced. This document needs to be between 20 and 200 words and needs to be uploaded to the Turnitin assignment for the practical summative.

  • A webpage (html file) knitted from RMarkdown that houses your text, R code, maps, results of your analysis, and citations, to the dropbox for our practical on Duo.

Summative Details

new Website: Assessment Criteria

The assignment and marking criteria can be found here and on Duo.

Please note that this guide is simply an educational supplement. I often take a more colloquial tone than you should in your submitted work. I also include iterative results and show results/information that you should suppress. I include these so that you can check your work and make sure you are on the right track.

Preparing your website

RMarkdown

As with the formative, you’ll be submitting a .html file. You can refer to the formative guide for more information on using a template or getting set up with RMarkdown.

The .html file is a web page that was knitted from your RMarkdown document. The advantage of this is that it automatically does an excellent job addressing some of the criteria in the new “webpage assessment criteria”

When you click on the knit button to knit your HTML file, you convert the .Rmd file (R Markdown) to an elegant .html file that includes all of your code and results. This will also tell you if you have any errors in your code.

The html and rmd files will be saved in your working directory.

It’s generally a good idea to start any project by setting your working directory setwd("~/Path/to/working/folder")

If you have your global environment crowded with things you entirely don’t need, clear them out before getting started with rm(list=ls())

A few tips/notes with RMarkdown:

  • Including formulas (like I do in the LQ ) is a little complicated as it uses LaTex which is another program that you may not have installed. Do not feel obligated to include fancy formulas.
  • Even when using a template (like this one), there are lots of ways you can make your markdown a little slicker (see this link for some fun suggestions)
  • I mention this in the bibliography section, but you need to add a reference to where you will store references in the header of your document. The header is called the YML.
  • It’s generally a good idea to load all the packages you’ll need at the very beginning of a document, rather than when you are using them. I did this to make this document easier for you to consume one bit per week, but it’s still not a good practice.

Add citations

Install the knitcitations package (if necessary):

cleanbib() prevents duplicates from being in the bibliography

library("knitcitations")
library("bibtex")
cleanbib()
options(citation_format = "pandoc", hyperlink = "to.bib", max.names = 4, cite.style = "authoryear")

While this is a wonderful package, the knitcitations documentation is not great. This allows you to create unique tags for in-text citations (as you would when writing a paper), and have R automatically save these references for you!

# generate a tag for to reference a website by just pasting in the URL
knitcitations::citep("http://www.floatingsheep.org")
[1] "[@greycite414460]"
citep("https://www.gq.com/story/eat-the-rich-digital-generation")
[1] "[@greycite414450]"
citep("https://github.com/VictimOfMaths")
[1] "[@greycite415039]"
# or use the DOI of an academic paper or blog
citep("10.1080/17549175.2015.1080752")
[1] "[@Anselin_2015]"
# you can even generate an in-text citation
citet("https://towardsdatascience.com/breaking-down-geocoding-in-r-a-complete-guide-1d0f8acd0d4b")
[1] "@greycite414532"
# write these to a new bibtex file:
write.bibtex(entry = NULL, file = "bibliography.bib", append = FALSE)

# also write one file to cite all of the R packages you use
write.bib(c(.packages()), file = "packages.bib", append = FALSE)

The output for each of these is a unique key referencing that. Now when I discuss (Anselin & Williams, 2015) I can just type the key that it generated [@Anselin_2015] and it will be replaced in an in text citation.

It will also do the same thing for packages…

If you are unsure what the unique key is for a reference,

  • you can just find the file (in the same folder with your .Rmd and your .html file)
  • called bibliography.bib in my example
  • open it, the references are listed, and the key is listed at the top of each reference
    • floatingsheep.org is now @greycite414460

Setting up markdown for citations

Download a .csl for whatever reference style you prefer:

  • Download a csl from (https://www.zotero.org/styles?format=author-date).
  • I chose the “American Geophysical Union” and for a style, so I downloaded a file called american-geophysical-union.csl.
  • This .csl file has to be stored in in the same directory/working folder of your hard disk as your .Rmd file and your .html, otherwise your file will NOT knit.

We need to call the .csl style in the YML at the top of the page. We also need to call the .bib files that holds all of the bibliographic references.

I called mine “bibliography.bib” and “packages.bib”

The YML of this document (lines 1-10 at the very top), looks like:

---
title: "Summative Draft Guide"
author: "Monica Stephens"
date: "2021-02-08"
bibliography: [bibliography.bib, packages.bib]
csl: american-geophysical-union.csl
output:
  rmdformats::downcute:
    self_contained: true
---

I’ve also used the code below to automatically generate a reference list at the end of the document to cite the packages I used. I did this with the following code chunk.

{r generateBibliography, results="asis", echo=FALSE, eval=TRUE, message=FALSE, warning=FALSE}
cleanbib()
knitcitations::read.bibtex(file = "bibliography.bib", check = FALSE)
knitcitations::read.bibtex(file = "packages.bib")

The code chunk options (the area between the “{curly brackets}” that begins with “r”) are really important, but they are not visible on the output of the Markdown document

This is what each of these means:

  • r generateBibliography - the ‘r’ indicates it is in R’s language/syntax. The rest is the label of the chunk.
    • When it knits, you’ll see the name and the progress towards it running in the RMarkdown console
    • While the chunk names can be anything (Susan, Chad, Kittens, etc), they have to be unique, you can’t have two chunks called “kitten”
  • results="asis" - What to do with the results of the code block, “asis” means the bibliography will be printed as text… like if I typed out the results it into the markdown document
    • This is really important for the bibliography because otherwise it will be in a “results chunk” (e.g. ## Author, title)
  • echo=FALSE - whether to print the code chunk (the source code) in the document or hide th code
    • echo=FALSE means you do NOT see the code
    • echo=TRUE is the default, this means you do see the code.
      • There are other areas of the document where I have also suppressed the code that I wrote to make the document more elegant.
  • eval=TRUE - this tells R whether to evaluate the code chunk (if it’s TRUE), and when to NOT calcuLate the code chunk (if eval = FALSE).
    • the geocode section (#geocode) is set to eval = FALSE
    • You should never have a section where eval = FALSE and also echo = FALSE
  • message=FALSE - Whether to print messages (like what you see in the console when this code runs) in the output of the document.
    • Because these should almost always be to “FALSE”, I have this set as a global option in the r setup (the very first chunk of the document where I load the knitr library and the template (rmdformats) library).
  • warning=FALSE - Just like messages, this is whether to print warnings, it is also set to false for the global options
  • cache=FALSE - The default is set to TRUE for the document, but for chunks that need to be run every time the document runs, you probably don’t want those.

Read about these options and more here

Summative sections

harvesting the data

Develop the code necessary to harvest a sufficient sample of tweets for each of these topics.

Depending on the event you are studying and the time you have to collect data (with a stable internet connect), decide if you want to stream the data or use the search_tweets function.

You most likely want to search the data, but just explain this decision in the text.

See the script from weeks 28 and 29 for more information on pulling phrases.

Make sure that you have your authentication with the API keys loaded. If you did the authentication previously, you can just enter get_token() and it should ensure your keys are authenticated.

Identifying key phrases

Check the news. Look through tweets that you find interesting. You can pick a hashtag, a word or a phrase. Ideally you do not want a phrase that is so large it is everywhere (e.g. COVID), or so small (a phrase used only among people in your college). Each phrase is basically a demographic proxy for something.

You need a second phrase as it gives a way to normalize the first one… Be sure to adequately describe these event/phenomena, why they are related or juxtaposed, why it is interesting to look at, and a brief justification for using Twitter to trace this.

Use appropriate filters and explain why you used them for example… a proxy for rich people might be iphone users tweeting about “wine”

library(rtweet)
rich_wine <- search_tweets("wine", source = "\"Twitter for iPhone\"", lang = "en", 
    geocode = "54.00366,-2.547855,300mi", n = 10000, token = bearer_token())

While the opposing term might be people who want to eat the rich.

etr <- search_tweets("eattherich OR \"eat the rich\"", geocode = "54.00366,-2.547855,300mi", 
    n = 10000)

There are fewer search terms limiting this because not many people are tweeting the phrase “eattherich” in the UK. This includes people tweeting or with profiles including “eattherich” and “eat the rich”

These terms could be contextualized with a short summary of the “eat the rich” digital trend (from @gqmagazine, 2019)

Learn more about limiting your search with the rtweet documentation

Identifying a geographic area

In order to get a large enough sample size, use either an entire country/continent (US, UK, Europe).

In the week 28/29 code we did this with either a bounding box or a search radius, depending if you are searching or streaming the tweets.

Mapping the geotagged tweets

# make sure your mappy packages are loaded
library(dplyr)
library(maps)
library(ggplot2)

# a lat and long in the UK
xlong <- -2.547855
ylat <- 54.00366

# Where in maps database is this lat and long? (create a variable for this)
region <- map.where(database = "world", xlong, ylat)

# create lat/lng variables using all available tweet and profile geo-location
# data
etr <- lat_lng(etr)


# notice how I use the region variable I created above and add to the xlong/ylat
# variables to set my extents?

maps::map("world", regions = region, fill = TRUE, col = "#ffffff", lwd = 0.25, mar = c(0, 
    0, 0, 0), xlim = c((xlong - 5), (xlong + 5)), y = c(ylat - 5, ylat + 5))
with(etr, points(lng, lat, pch = 20, col = "blue"))

Let’s do that again for our rich wine tweets

# create lat/lng variables using all available tweet and profile geo-location
# data
rich_wine <- lat_lng(rich_wine)

# notice how I use the region variable I created above and add to the xlong/ylat
# variables to set my extents?

maps::map("world", regions = region, fill = TRUE, col = "#ffffff", lwd = 0.25, mar = c(0, 
    0, 0, 0), xlim = c((xlong - 5), (xlong + 5)), y = c(ylat - 5, ylat + 5))
with(rich_wine, points(lng, lat, pch = 20, col = "red"))

Please note if you were making a map of the United States the maps::map() has 3 databases for the USA and only one for “world” see help(package='maps') for more details.

Cleaning the data

You likely already did this to get the map above, but be sure you have added the lat/lng variables before continuing

# create lat/lng variables using all available tweet and profile geo-location
# data
etr <- lat_lng(etr)
rich_wine <- lat_lng(rich_wine)

We have way too many attributes that we’re never going to use. It’s difficult to see how much geographic information we have about these tweets when there are 90 attributes

Let’s look at what we have:

library(dplyr)
allatt <- as_tibble(names(rich_wine))
allatt
# A tibble: 92 x 1
   value               
   <chr>               
 1 user_id             
 2 status_id           
 3 created_at          
 4 screen_name         
 5 text                
 6 source              
 7 display_text_width  
 8 reply_to_status_id  
 9 reply_to_user_id    
10 reply_to_screen_name
# … with 82 more rows

Let’s reduce these to ones we are using (you may chose different ones)

### which data set are we reducing?

## here is a list of just the attributes I want to keep (change as you wish)
locattributes <- c("status_id", "screen_name", "quoted_location", "retweet_location", 
    "place_name", "place_full_name", "place_type", "country", "country_code", "location", 
    "lat", "lng")

## This selects just the column names in rich_wine that are also in locattributes
## to be in our new dataset called wineattr
wineattr <- rich_wine[, (colnames(rich_wine) %in% locattributes)]

etrattr <- etr[, (colnames(etr) %in% locattributes)]

Notice for each tweet that there are many missing the geographic coordinates associated with them. This is in the lat and lng columns. Let’s calculate what percentage have geographic coordinates:

# how many values are NOT blank for the lng column?
count_etr_lng <- length(which(!is.na(etrattr$lng)))
count_etr_na <- length(which(is.na(etrattr$lng)))
count_etr_lng/count_etr_na
[1] 0.00877193
count_wine_lng <- length(which(!is.na(wineattr$lng)))
count_wine_na <- length(which(is.na(wineattr$lng)))
count_wine_lng/count_wine_na
[1] 0.05621302

This means that <2% of eat the rich have geotags and <5% of wine have geotags. Of course–each time you run this code, you’ll get different numbers for the percentages.

If this were my “actual” summative, I’d want to explain/hytpothesize as to why there are so many more tweets for wine that are geocoded than eat the rich. What do you think the impact of this disparity might be on the results of the study? Be sure to keep this explanation under 200 words.

As a side note, this is a good time to save the data you have downloaded as a csv just in case R crashes. You wouldn’t include this code in the summative (or even mention this).

You should hide a code chunk like this with echo=FALSE, I left it here in case you needed a reminder of how to do it.

write.csv(wineattr, "wineattr.csv")
write.csv(etrattr, "ettrattr.csv")

Geocoding

You may want to increase the sample size you are pulling by geocoding the profile information.

Despite what many online instructions tell you, unless you have a credit-card, you can not use the Google Maps API for geocoding, which is the geocoder in many mapping packages.Sadly, this is the default in the ggmap library.

Read more about geocoding(in Titorchuk, 2020)

Let’s find which data we need to geocode first

# removing the values that have lat/long in the lng attribute
df <- etrattr[-which(!is.na(etrattr$lng)), ]

# removing values which are blank
dfhasloc <- df[-which(df$location == ""), ]

# removing duplicated values
new_df <- dfhasloc[-which(duplicated(dfhasloc$location)), ]

# repeating the same for the wine dataframe
wdf <- wineattr[-which(!is.na(wineattr$lng)), ]
wdfhasloc <- wdf[-which(wdf$location == ""), ]
wnew_df <- wdfhasloc[-which(duplicated(wdfhasloc$location)), ]

Take a look through these locations in your new dataframe… (you can open it from the Global Environment)

Locations from Profiles
x
West Sussex, England
Little Venice, Westminster, UK
Liverpool
Co Armagh
Taunton, England
Essex UK
Salisbury, UK
Iver, UK
Edinburgh, Scotland
United Kingdom
Suffolk, UK
European Funguy Metroplex
Lancaster, England
hackney
The Cloud

Do you think these will geocode? Possibly not successfully…

OSM allows up to 1 request per second (see the usage policy). That means if there are a ton of useless geocodes it is running, this will take forever!

Tidygeocoder is great for geocoding because you can use OSM for free!

If necessary, install the package (install.packages('tidygeocoder'))

This part will take 1 second per row/entry, so don’t try this if you need to use R in the next few hours…

In full disclosure, I have this set to eval=FALSE in my Markdown document because this would take excessively long to run each time. I have some other secret code (read.csv) from where I imported the .csv files because the document wouldn’t knit.

library("tidygeocoder")
etr2geo2 <- geo(address = new_df$location, method = "osm", lat = latitude, long = longitude)

wine2geo2 <- geo(address = wnew_df$location, method = "osm", lat = latitude, long = longitude)

I’ve never tried it, but if you have postcode or are looking for postcode, this may work for the UK PostcodioR

Let’s merge our original table with our new one that has locations for the profile

etrgeoall <- merge(etr2geo2, etrattr, by.x = "address", by.y = "location")
### removing duplicate statuses
etr_stid <- etrgeoall[-which(duplicated(etrgeoall$status_id)), ]

### merging our latitude and longitude
etr_stid$lat2 <- coalesce(etr_stid$lat, etr_stid$latitude)
etr_stid$long <- coalesce(etr_stid$lng, etr_stid$longitude)

We now have 2796 locations for “eat the rich”. Look through the locations listed for the profile locations. Some of them are clearly not in the UK…

Let’s see what those look like on the map:

maps::map("world", fill = TRUE, col = "#ffffff", lwd = 0.25, mar = c(0, 0, 0, 0))
with(etr_stid, points(long, lat2, pch = 20, col = "blue"))

Wow, this is super messy!!! We know they were supposed to be in the UK… that’s where we were pulling from.

Let’s install.packages("CoordinateCleaner") as it can help clean this dataset in lots of ways read about coordinate cleaner here

library("CoordinateCleaner")

## removing NAs from our new longitude (places that had no hope)
geoetr <- etr_stid[-which(is.na(etr_stid$long)), ]

geoetr$GBR <- "GBR"

cc_etr <- cc_coun(geoetr, lon = "long", lat = "lat2", iso3 = "GBR")

Now let’s do the same operations for the wine data set

winegeoall <- merge(wine2geo2, wineattr, by.x = "address", by.y = "location")
### removing duplicate statuses
wine_stid <- winegeoall[-which(duplicated(winegeoall$status_id)), ]

### merging our latitude and longitude
wine_stid$lat2 <- coalesce(wine_stid$lat, wine_stid$latitude)
wine_stid$long <- coalesce(wine_stid$lng, wine_stid$longitude)

## removing NAs from our new longitude (places that had no hope)
geowine <- wine_stid[-which(is.na(wine_stid$long)), ]

## adding a column for the iso3 country code that these points should fall in...
geowine$GBR <- "GBR"

## testing to see if the points fall within GBR
cc_wine <- cc_coun(geowine, lon = "long", lat = "lat2", iso3 = "GBR")

Now in the end we have 5228 in our proxy for “rich people” and we have 1610 who want to eat the rich (these numbers vary based on how many were in the initial pull).

Normalization

Be sure that your data is normalized in some way.

Make sense of the data by converting raw numbers into ratio numbers:

  • People live in cities
  • People tweet where they have internet access
  • There are more tweets about ANYTHING in cities with more population

So, saying “there is a cluster of rich people in London” is pretty meaningless. We have to normalize by some measure–ideally something related to Twitter to get a sense of where there is more of a tweet/topic.

The advantage of looking at two phenomena is that you can normalize in the comparison between these

There are several different ways to normalize:

  • You could divide the number of tweets by the some measure of population to get an estimate of how many people are tweeting in any one region…
  • This will under-estimate the frequency of tweets among populations unlikely to tweet (e.g. elderly, children, areas without internet)
  • We do not have an estimate of how much twitter exists all together across geographic areas

Try creating a grid

library(mapplots)
library("RColorBrewer")


cc_wine2 <- cc_wine
cc_wine2$one <- 1
byx = 0.5/2
byy = 0.25/2

ymin <- (min(cc_wine2$lat2)) - 1
ymax <- (max(cc_wine2$lat2)) + 1
xmin <- (min(cc_wine2$long)) - 0.5
xmax <- (max(cc_wine2$long)) + 0.5

xlim <- c(xmin, xmax)
ylim <- c(ymin, ymax)

grd <- make.grid(cc_wine2$long, cc_wine2$lat2, cc_wine2$one, byx, byy, xlim, ylim)
breaks <- breaks.grid(grd, quantile = 0.975, ncol = 6, zero = TRUE)
basemap(xlim, ylim, main = "iphone owners tweeting for wine", bg = "white")
draw.grid(grd, breaks, col = brewer.pal(n = 6, name = "PuRd"))
legend.grid("topright", breaks = breaks, col = brewer.pal(n = 6, name = "PuRd"), 
    type = 1, bg = NULL, inset = 0.02, title = "tweets")

Let’s do the same thing for eat the rich

cc_etr2 <- cc_etr
cc_etr2$one <- 1

# keep the minimums and the maximums, and the byx and byy from the wine map

grdetr <- make.grid(cc_etr2$long, cc_etr2$lat2, cc_etr2$one, byx, byy, xlim, ylim)
breaks <- breaks.grid(grd, quantile = 0.975, ncol = 6, zero = TRUE)
basemap(xlim, ylim, main = "tweets for 'eat the rich'", bg = "white")
draw.grid(grd, breaks, col = brewer.pal(n = 6, name = "Oranges"))
legend.grid("topright", breaks = breaks, col = brewer.pal(n = 6, name = "Oranges"), 
    type = 1, bg = NULL, inset = 0.02, title = "tweets")

Comparing the data

Compare the spatial distribution of the two phenomena.

You can use statistical measures (like a quadrat analysis as we did in practial A), or you could use a LQ..

Simple normalizaiton

etr_per_wine_grid <- grdetr/grd

breaks <- breaks.grid(etr_per_wine_grid, quantile = 0.975, ncol = 6, zero = TRUE)
basemap(xlim, ylim, main = "eat the rich per wino", bg = "white")
draw.grid(etr_per_wine_grid, breaks, col = brewer.pal(n = 6, name = "Greens"))
legend.grid("topright", breaks = breaks, col = brewer.pal(n = 6, name = "Greens"), 
    type = 1, bg = NULL, inset = 0.02, title = "tweets")

Try a Location quotient

Normally given as
\[LQ = \frac{(e_i /\sum e)}{(e_i+E_i / \sum e_i+E)}\] Because we don’t have an idea of what the the total number of tweets for any one area is in this case, we will use the sum of wine+etr

\[LQ = \frac{(etrcell / etrtotal)}{(winecell+etrcell) / (winetotal+etrtotal)}\] We’ll use our grid for each enumeration unit

## A sum of how many points there are total in the 'global' note that this creates
## a constant (just one number) that we'll use a bunch
etrtotal <- sum(cc_etr2$one)
winetotal <- sum(cc_wine2$one)
## A sum or the sums
totalbig <- etrtotal + winetotal


## A grid of ETR/etrtotal gives the proportion of ETR tweets out of the whole for
## any cell
propetr_grid <- grdetr/etrtotal

# A grid of wine + etr gives an idea of how many tweets per cell total
etr_plus_wine_grid <- grdetr + grd

# dividing etr+wine grid by the total big number gives our denominator
denominator <- etr_plus_wine_grid/totalbig

# putting it all together
lq <- propetr_grid/denominator



breaks <- breaks.grid(lq, quantile = 0.975, ncol = 6, zero = FALSE)
basemap(xlim, ylim, main = "LQ of eat the rich vs wino", bg = "white")
draw.grid(etr_per_wine_grid, breaks, col = brewer.pal(n = 6, name = "RdYlGn"))
legend.grid("topright", breaks = breaks, col = brewer.pal(n = 6, name = "RdYlGn"), 
    type = 1, bg = NULL, inset = 0.02, title = "LQ")

Basically the area of the darkest green means there are twice as many people tweeting “eat the rich” than there are tweeting for wine.

for more information on measuring social media see (Anselin & Williams, 2015): Anselin, L., & Williams, S. (2016). Digital neighborhoods. Journal of Urbanism: International Research on Placemaking and Urban Sustainability, 9(4), 305-328.

Mapping the data

Interactive/Leaflet map

If you have a lot of points (and I mean a lot), it might take a lot of time to load these on a leaflet map (as we did in practical A).

library(leaflet)
library("sf")

point_geo <- st_as_sf(cc_wine, coords = c(x = "long", y = "lat2"), remove = FALSE, 
    crs = 4326)

# leaflet(data = point_geo) %>% addTiles() %>% addMarkers()

leaflet(point_geo) %>% addProviderTiles(providers$CartoDB.DarkMatter, options = providerTileOptions(minZoom = 5, 
    maxZoom = 10)) %>% setView(xlong, ylat, zoom = 5) %>% addCircles(~long, ~lat2, 
    weight = 3, radius = 40, color = "#ffa500", stroke = TRUE, fillOpacity = 0.8)

Please note that if you are using a leaflet map, it is very important that you do not accidentally disclose someones location because a user of your website/map can zoom in on any tweet and see the street address of where the tweet took place.

Choropleth map

To make a choropleth map, we need a detailed basemap (shapefile) of all the counties of the UK. There are many available sources for basemaps, be sure that you cite whatever you use. You can use something from a package, or you can import a shape from a variety of sources.

Ideally you loaded these packages at the beginning of your document

library(curl)
library(sf)
library(rgdal)
library(tidyverse)
library(maps)
library(maptools)
library(ggplot2)

Read in a shapefile of all the counties in the UK from ArcGIS’s open server

# create a space for the file
temp <- tempfile()
# create a space to unzip the file
unzipped <- tempfile()
# this is where you'll get the file from
source <- "https://opendata.arcgis.com/datasets/43b324dc1da74f418261378a9a73227f_0.zip?outSR=%7B%22latestWkid%22%3A27700%2C%22wkid%22%3A27700%7D"
# now use curl to download the shapefile to the temp file
temp <- curl_download(url = source, destfile = temp, quiet = FALSE, mode = "wb")
# unzip the file into your new 'unzipped directory'
unzip(zipfile = temp, exdir = unzipped)

I particularly like this map from Colin Angus’ (“VictimOfMaths - overview,” 2021) 30 Day Map challenge * Here is his github code, * I followed parts of his code and processed my file the same way.

# The actual shapefile has a different name each time you download it, so need to
# fish the name out of the unzipped file
name <- list.files(unzipped, pattern = ".shp")
shapefile <- st_read(file.path(unzipped, name)) %>% rename(code = ctyua19cd) %>% 
    # Remove Northern Ireland
filter(substr(code, 1, 1) != "N")
Reading layer `Counties_and_Unitary_Authorities_(December_2019)_Boundaries_UK_BFC' from data source `/private/var/folders/0g/zl3lc9g551zdm0g4t1bnj_2w0000gq/T/Rtmp3XGvIu/file45925c14f3eb/Counties_and_Unitary_Authorities_(December_2019)_Boundaries_UK_BFC.shp' using driver `ESRI Shapefile'
Simple feature collection with 216 features and 10 fields
geometry type:  MULTIPOLYGON
dimension:      XY
bbox:           xmin: -116.1928 ymin: 5337.901 xmax: 655653.8 ymax: 1220302
projected CRS:  OSGB 1936 / British National Grid

If your data is relevant to Northern Ireland, you may want to keep it in…

Let’s check the contents of this simple feature object:

shapefile
Simple feature collection with 205 features and 10 fields
geometry type:  MULTIPOLYGON
dimension:      XY
bbox:           xmin: 5512.998 ymin: 5337.901 xmax: 655653.8 ymax: 1220302
projected CRS:  OSGB 1936 / British National Grid
First 10 features:
  objectid      code            ctyua19nm ctyua19nmw  bng_e  bng_n     long
1        1 E06000001           Hartlepool       <NA> 447160 531474 -1.27018
2        2 E06000002        Middlesbrough       <NA> 451141 516887 -1.21099
3        3 E06000003 Redcar and Cleveland       <NA> 464361 519597 -1.00608
4        4 E06000004     Stockton-on-Tees       <NA> 444940 518183 -1.30664
5        5 E06000005           Darlington       <NA> 428029 515648 -1.56835
6        6 E06000006               Halton       <NA> 354246 382146 -2.68853
       lat st_areasha st_lengths                       geometry
1 54.67614   93712620   71011.93 MULTIPOLYGON (((447213.9 53...
2 54.54467   53881564   44481.69 MULTIPOLYGON (((448609.9 52...
3 54.56752  245069509   96703.99 MULTIPOLYGON (((455932.3 52...
4 54.55691  204932954  123408.99 MULTIPOLYGON (((444157 5279...
5 54.53534  197475689  107206.40 MULTIPOLYGON (((423496.6 52...
6 53.33424   79084035   77771.10 MULTIPOLYGON (((358374.7 38...
 [ reached 'max' / getOption("max.print") -- omitted 4 rows ]

The short report printed gives the geometry type, mentions that there are 205 features (records, represented as rows) and 10 fields (attributes, represented as columns). Each row is a simple feature: a single record, or data.frame row, consisting of attributes and geometry.

Note: there would be 215, but I removed Northern Ireland.

For each row a single simple feature geometry is given. The above printed output shows that geometries are printed in abbreviated form, but we can view a complete geometry by selecting it, e.g. the first one by:

ukgeom <- st_geometry(shapefile)
ukgeom[[1]]

Please note, you would not include these sorts of checks in your summative document. I’m just showing it here so that you have idea of what this file should look like in case there is a problem with yours.

The way this is printed is called well-known text, and is part of the standards. The word MULTIPOLYGON is followed by three parenthesis, because it can consist of multiple polygons, in the form of MULTIPOLYGON(POL1,POL2), where POL1 might consist of an exterior ring and zero or more interior rings, as of (EXT1,HOLE1,HOLE2)

(for more information, see the Simple Features vignette)

We need to Convert coordinates into WGS84 to match with the twitter coordinates.

ukgeom_wgs84 <- shapefile %>%
  st_buffer(0) %>% # Make invalid geometries valid
  st_transform(crs = 4326) %>% # Convert coordinates to WGS84
  mutate(id = 1:nrow(.)) # Add column with id to each county

modifying the twitter dataset

Just as we did when making a grid with our data, we’re adding a column of ones, and a column (tw) to indicate what it was a tweet for. Then we combine the twitter keywords into one spatial feature.

## Add columns to our twitter files to indicate what they are
cc_wineB <- cc_wine
cc_wineB$tw <- "wine"
cc_wineB$one <- 1

cc_etrB <- cc_etr
cc_etrB$tw <- "etr"
cc_etrB$one <- 1

## combine our twitter points into one file with a column (tw) that shows what it
## was a tweet for
alltws <- rbind(cc_etrB, cc_wineB)

# Convert to simple feature object with the WGS84 projection
alltws_sf <- st_as_sf(alltws, coords = c("long", "lat2"), crs = 4326)

merge SF objects

In the next step we merge the simple feature objects. The function sf::st_join can be used to do a spatial left or inner join. See documentation ?st_join on how to carry out an inner join

# this one takes a while as it creates a large matrix of these results
result_all <- st_within(alltws_sf, ukgeom_wgs84, sparse = FALSE)

# filter for the points with 'etr' in the tw column
point_etr_sf <- alltws_sf %>% filter(tw %in% "etr")

result_etr <- st_within(point_etr_sf, ukgeom_wgs84, sparse = FALSE)

You will get the warning “although coordinates are longitude/latitude, st_within assumes that they are planar.” You can ignore this. It would matter if you were mapping the whole world and expecting the arctic regions to matter…

Now we need to count how many of the points (from Twitter) in the matrix we created above fall within each of the counties in the shapefile.

The mutate command will count how many tweets are in each polygon that have etr (the result_etr from above) and total, and calculate this as a percent

ukgeom_wgs84 <- ukgeom_wgs84 %>% mutate(Count_all = apply(result_all, 2, sum), Count_etr = apply(result_etr, 
    2, sum)) %>% # Calculate the percentage
mutate(Percent = ifelse(Count_etr == 0, Count_all, Count_etr/Count_all * 100))


uk_df <- ukgeom_wgs84
st_geometry(uk_df) <- NULL


# a full join will combine our original shapefile with this new dataframe we've
# created
map.data <- full_join(shapefile, uk_df, by = "code")

The shapefile that we’ve been using is very large (67.4 MB), and very detailed. It has more detail than we need (e.g. the shape of the coastline and such). Mapping this file would take a very long time, so let’s simplify the lines and reduce the complexity of the polygons.

If you add the library(pryr), you can calculate the size of any object, which is very helpful if you something seems to be taking an insanely long amount of time. For example, I used object_size(map.data) to discover that our map.data file is huge!

We can simplify the lines used in the map.data to make the map process in a reasonable amount of time (and this difference will hardly be visible at the scale we’re using)

If we were doing spatial analysis that required identifying the spatial relationship between polygons, it would be important to preserve the Topology, but we are not… So, I’ve set it to False, which again reduces the file size by removing all the lines of code telling R which polgyons are aligned.

# newfile<-st_simplify(bigfile, distance_in_meters, preserveTopology)
mdsimple <- st_simplify(map.data, 100, preserveTopology = FALSE)

As this file is stll in OSGB 1936 / British National Grid, we do not need to reproject it or worry about the projection. We can always check a vector object’s CRS with st_crs()

Finally, let’s make a simple map

m1 <- ggplot() + geom_sf(data = mdsimple, aes(fill = Percent))
m1

If I wanted to save this map to my hard disk and export it as a .jpg (or .png, or .pdf) at this point I would add ggsave("m1.jpg", m1) (or ggsave("m1.pdf",m1) or whatever format I wanted it in)

I want to improve the aesthetics of this map though…

Particularly I want to get rid of the background with graticules

m2 <- ggplot() +
  geom_sf(data = mdsimple, aes(fill = Percent), colour=NA)+ # colour=NA is the county outline
  theme_classic()+ #this theme has a blank background
  theme(axis.line=element_blank(), axis.ticks=element_blank(), #gets rid of the axis
        axis.text=element_blank(), axis.title=element_blank())

#and then I can call the map so it prints here in rmarkdown
m2

Well, that’s much cleaner, but I don’t love the colours. I’d also like to give my legend a better name, to make it clear what this is a percentage of…

You can pick out a better color gradient/scheme from the paletteer package.

If there is one you like, be sure to install the paletter package.

The “direction=-1” is commented out because it would invert the colour scale (so that red is high and yellow is low)

library(paletteer)
library(gameofthrones)
m3 <- ggplot() +
  geom_sf(data = mdsimple, aes(fill = Percent), colour=NA)+ 
  paletteer::scale_fill_paletteer_c("gameofthrones::lannister", #direction=-1,
                         name="Eat the rich\nPercent",
                         na.value="transparent")+
  theme_classic()+ #this theme has a blank background
  theme(axis.line=element_blank(), axis.ticks=element_blank(), #gets rid of the axis
        axis.text=element_blank(), axis.title=element_blank())

#and then I can call the map so it prints here in rmarkdown
m3

Ok, now I’m happy with the colours, but I’d like to change the background to be dark gray, and then I need to also make the text on the legend a lighter colour of gray.

m4 <- ggplot() +
  geom_sf(data = mdsimple, aes(fill = Percent), colour=NA)+ 
  paletteer::scale_fill_paletteer_c("gameofthrones::lannister", #direction=-1,
                         name="Eat the rich\nPercent",
                         na.value="transparent")+
  theme_classic()+ 
  theme(axis.line=element_blank(), axis.ticks=element_blank(), 
        axis.text=element_blank(), axis.title=element_blank(),
        plot.background=element_rect(fill="#252525", colour = NA),
        panel.background=element_rect(fill="#252525", colour = NA),
        legend.background=element_rect(fill="#252525"),
        legend.text=element_text(colour="#c6dbef"),
        legend.title=element_text(colour="#c6dbef"))

#and then I can call the map so it prints here in rmarkdown
m4

This is almost perfect. I just want to add in my title and move the legend…

m4 <- ggplot() +
  geom_sf(data = mdsimple, aes(fill = Percent), colour=NA)+ 
  paletteer::scale_fill_paletteer_c("gameofthrones::lannister", #direction=-1,
                         name="Eat the rich\nPercent",
                         na.value="transparent")+
  theme_classic()+ 
  theme(axis.line=element_blank(), axis.ticks=element_blank(), 
        axis.text=element_blank(), axis.title=element_blank(),
        plot.background=element_rect(fill="#252525", colour = NA),
        panel.background=element_rect(fill="#252525", colour = NA),
        legend.background=element_rect(fill=NA),
        legend.text=element_text(colour="#c6dbef", size = 8),
        legend.title=element_text(colour="#c6dbef", size = 8),
        legend.key.width= unit(8, 'pt'),
        plot.title=element_text(colour="#c6dbef", face="bold", size=18),
        plot.subtitle=element_text(colour="#c6dbef", size=10),
        plot.caption=element_text(colour="#c6dbef", size=8, hjust=1))+
    labs(title="Eat The Rich",
       subtitle="A demographic proxy for class angst\nand wealth across Great Britain",
       caption="Map by M.Stephens")

#preview the map
m4

I would not recommend showing iterative steps towards developing a map for your summative. I’ve included it here to illustrate how each of these lines of code works together.

Conclusions

In your summative it’s important that you conclude your research.
This is a good place to:

  • Describe any patterns, clusters, or trends you identified.
  • Summarize your findings
    • Reflect on your hypothesis
    • why did these distributions occur as they did?
  • What further research would you recommend to better understand this phenomena?
  • Evaluate this process with a brief reflection of using Twitter in human geography research.

References List

[1] VictimOfMaths - Overview. <URL: https://github.com/VictimOfMaths>. 2021. <URL: https://github.com/VictimOfMaths>.

[2] (???). How “Eat the Rich” Became the Rallying Cry for the Digital Generation. <URL: https://www.gq.com/story/eat-the-rich-digital-generation>. 2019. <URL: https://www.gq.com/story/eat-the-rich-digital-generation>.

[3] L. Anselin and S. Williams. “Digital neighborhoods”. In: Journal of Urbanism: International Research on Placemaking and Urban Sustainability 9.4 (Sep. 2015), pp. 305-328. DOI: 10.1080/17549175.2015.1080752. <URL: https://doi.org/10.1080/17549175.2015.1080752>.

[4] M. Graham. floatingsheep. <URL: http://www.floatingsheep.org/>. 2015. <URL: http://www.floatingsheep.org/>.

[5] O. Titorchuk. Breaking Down Geocoding in R: A Complete Guide. <URL: https://towardsdatascience.com/breaking-down-geocoding-in-r-a-complete-guide-1d0f8acd0d4b>. 2020. <URL: https://towardsdatascience.com/breaking-down-geocoding-in-r-a-complete-guide-1d0f8acd0d4b>. [1] J. Barnier. rmdformats: HTML Output Formats and Templates for ‘rmarkdown’ Documents. R package version 1.0.1. 2021. <URL: https://CRAN.R-project.org/package=rmdformats>.

[2] C. Boettiger. knitcitations: Citations for ‘Knitr’ Markdown Files. R package version 1.0.12. 2021. <URL: https://CRAN.R-project.org/package=knitcitations>.

[3] R. Francois. bibtex: Bibtex Parser. R package version 0.4.2.3. 2020. <URL: https://CRAN.R-project.org/package=bibtex>.

[4] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. Vienna, Austria, 2020. <URL: https://www.R-project.org/>.

[5] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. Vienna, Austria, 2020. <URL: https://www.R-project.org/>.

[6] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. Vienna, Austria, 2020. <URL: https://www.R-project.org/>.

[7] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. Vienna, Austria, 2020. <URL: https://www.R-project.org/>.

[8] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. Vienna, Austria, 2020. <URL: https://www.R-project.org/>.

[9] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. Vienna, Austria, 2020. <URL: https://www.R-project.org/>.

[10] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. Vienna, Austria, 2020. <URL: https://www.R-project.org/>.

[11] Y. Xie. Dynamic Documents with R and knitr. 2nd. ISBN 978-1498716963. Boca Raton, Florida: Chapman and Hall/CRC, 2015. <URL: https://yihui.org/knitr/>.

[12] Y. Xie. “knitr: A Comprehensive Tool for Reproducible Research in R”. In: Implementing Reproducible Computational Research. Ed. by V. Stodden, F. Leisch and R. D. Peng. ISBN 978-1466561595. Chapman and Hall/CRC, 2014. <URL: http://www.crcpress.com/product/isbn/9781466561595>.

[13] Y. Xie. knitr: A General-Purpose Package for Dynamic Report Generation in R. R package version 1.30. 2020. <URL: https://yihui.org/knitr/>.

Anselin, L., & Williams, S. (2015). Digital neighborhoods. Journal of Urbanism: International Research on Placemaking and Urban Sustainability, 9(4), 305–328. https://doi.org/10.1080/17549175.2015.1080752

@gqmagazine. (2019). How “eat the rich” became the rallying cry for the digital generation. GQ. https://www.gq.com/story/eat-the-rich-digital-generation. Retrieved from https://www.gq.com/story/eat-the-rich-digital-generation

VictimOfMaths - overview. (2021). GitHub. https://github.com/VictimOfMaths. Retrieved from https://github.com/VictimOfMaths

Submitting your website

The ‘Readme.doc’

One of the deliverables is the ‘readme’ which is a plain text document. You can prepare it in Microsoft Word, or whatever text editor you prefer (you can do it in markdown if you want). This document should include the following details:

  • Your name
  • The title of your project
  • The name of the .html file you uploaded to the Duo dropbox
    • This should be givenname_surname.html
  • A brief description of your project
  • A brief reflection on your experience producing it and any challenges you faced.

This document needs to be between 20 and 200 words and needs to be uploaded to the Turnitin summative assignment submission point on DUO.

The .html submission

As you are producing your document in Rmarkdown, be sure to knit often, and check the resulting .html file.

You will only be uploading the .html file and not the .rmd file. Save that file as it may be helpful for you in the future.

Once you have are finished with your project and have knitted a final time, before you submit your .html document, please rename this file to givenname_surname.html

  • where givenname is your first name
  • surname is your lastname

Upload this .html document to the special dropbox on DUO for our formative. Do not submit it to turnitin. Only the readme will be submitted on turnitin.