1

I'm working on a class project using a Chicago crime data set and R. One of the attributes in the data set is Block which contains partial addresses where the incident occurred. For example:

+--------------------------+
|           Block          |
+--------------------------+
|  45xx N Locust Grove St  |
|   65xx Hawthorne Ave     |
+--------------------------+

The values in Block vary in length but I am wanting to create a new variable with the street type, St, Ave, Blvd, etc. I have tried using the separate function from tidyr.

df <- df %>%
   separate(Block, into = c("partial.address, "type"),
           sep = " ", extra = "merge", fill = "left")

However, this returns the number, 45xx, as the partial.address value and the remaining value is in type. How can I select the street type from the address?

I'm hoping to get something like this as output:

+--------------------------+-------------+
|     partial.address      |     type    |
+--------------------------+-------------+
|  45xx N Locust Grove     |      St     |
|   65xx Hawthorne         |     Ave     |
+--------------------------+-------------+
0

1 Answer 1

2

You can use extract :

tidyr::extract(df, Block, c("partial.address", "type"), "(.*)(St|Ave)")

#      partial.address  type
#1 45xx N Locust Grove    St
#2      65xx Hawthorne   Ave

Or using stringr :

library(dplyr)
library(stringr)

df %>%
  mutate(type = str_extract(Block, '(St|Ave)'), 
         partial.address = str_remove(Block, type))

You can include more patterns in (St|Ave) if you have more.


If we want to capture the last word of each Block we can use :

df %>%
  mutate(type = str_extract(Block, '\\w+$'), 
         partial.address = str_remove(Block, type))

data

df <- structure(list(Block = c("45xx N Locust Grove St", "65xx Hawthorne Ave"
)), class = "data.frame", row.names = c(NA, -2L))
Sign up to request clarification or add additional context in comments.

3 Comments

Is there not a way to split on the white space and select the index for the last position to fill the variable type? The data set is 15,000 records and there are several street types.
@miguelf88 See updated answer. It selects the last word in each Block.
Thanks @RonakShah. Using the str_extract with '\\w+$' worked. I tried using '(St|Ave|Blvd|Rd)' but the column was filled with NA values.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.