This is an exploration of th MapChi tools to geocode data. The goal is to get zip code and census tract values from the data.
The result is we can get ZIP for about half of the addresses, but it does not return the census tract value at all.
# install.packages("devtools")
# library(devtools)
# install_github("dmwelgus/MapChi")
library(tidyverse)
library(MapChi)
I downloaded test data from the census bureau for how their geocode works. Note the original does not have a header column, so I added col names so I could see what they are. You do NOT want a header column in the data you send through census_geo.
census_example <- read_csv(
"data-raw/test-addresses.csv",
col_names = c("id", "address", "city", "state", "zip"),
col_types = cols(zip = col_character())
)
census_example
IIRC, At some point I tested the geocoder without a zip column (cause we don’t have them in gunviolence) and it would not process properly. You have to at least have a blank column there.
This runs the MapChi census_geo function on the census bureau test file to see the expected return. Note that it does NOT return a census tract, so this won’t be able to give us that.
test_geocoded <- census_geo("data-raw/test-addresses.csv")
test_geocoded %>% names
## [1] "id" "o_address" "o_city" "o_state" "o_zip"
## [6] "status" "quality" "m_address" "m_city" "m_state"
## [11] "m_zip" "long" "lat" "not_sure" "L_R"
test_geocoded
Note the first column there with all the values. I can’t remove that through select(). I did end up later removing by writing the data frame to csv and then reimporting, which I do below for the tx gun violence data.
Imports the cleaned data.
tx <- read_rds("data-out/01_tx.rds")
The census_geo() function expects a csv file with spedific columns, so we create that here and write it out.
tx %>%
mutate(
zip = "",
tx = "TX"
) %>%
select(
id,
address,
city,
tx,
zip
) %>%
write_csv("data-out/02_tx_addresses.csv", col_names = F)
Runs the geocoder.
tx_addresses <- census_geo("data-out/02_tx_addresses.csv")
The resulting dataframe has a weird first column that is a concatenation of all the fields that I can’t remove through select(), so I’m writing out to csv and then reimporting.
# export
tx_addresses %>%
write_csv("data-out/02_tx_addresses_geo.csv")
# import
tx_geo <- read_csv("data-out/02_tx_addresses_geo.csv")
## Parsed with column specification:
## cols(
## id = col_double(),
## o_address = col_character(),
## o_city = col_character(),
## o_state = col_character(),
## o_zip = col_logical(),
## status = col_character(),
## quality = col_character(),
## m_address = col_character(),
## m_city = col_character(),
## m_state = col_character(),
## m_zip = col_double(),
## long = col_double(),
## lat = col_double(),
## not_sure = col_double(),
## L_R = col_character()
## )
How did the geocoder fare?
tx_geo %>%
count(status, quality)
We got 92 great records and 40 good ones. 138 were not geocoded.
At least the zips we did get start with 7 as they should. 138 records do not have zip codes.
tx_geo %>%
count(m_zip)
Allows us to look at the address in the data vs the address used for the geocoding.
tx_geo %>%
filter(!is.na(quality)) %>%
arrange(quality %>% desc()) %>%
select(quality, o_address, m_address)
We join our geocoded fields back to the original data in case we want to use it later.
Prepare the geocoded data frame to just have the cols we need joined.
tx_geo_2join <- tx_geo %>%
arrange(id) %>%
select(
id, m_address, m_city, m_zip, lat, long
) %>%
mutate(m_zip = m_zip %>% as.character())
Join them
tx_joined <- left_join(tx, tx_geo_2join)
## Joining, by = "id"
tx_joined %>% write_rds("data-out/02_tx_joined.rds")
tx_joined %>% write_csv("data-out/02_tx_joined.csv")