I have a dataframe with over 20k obs. One of the columns is "city names" (df$city). There are over 600 unique city names. Some of them are misspelled.
Example of my dataframe:
> df$city
[1] "BOSTN" "LOS ANGELOS" "NYC" "CHICAGOO"
[2] "SEATTLE" "BOSTON" "NEW YORK CITY"
I have a csv file I created that has a list of all the misspelled city names and what the correct name should be.
> head(city)
city city_incorrect
1 BOSTON BOSTN
2 LOS ANGELES LOS ANGELOS
3 NEW YORK CITY NYC
4 CHICAGO CHICAGOO
Ideally I would write code that replaces values in df$city based on the "city.csv" file.
Note: I originally posted this question and someone suggested I use merge, I don't think this is the most efficient way to solve my problem because I would also have to include the 600 correctly spelled cities in my "city.csv" file. OR I think I'd need an additional step that combines the two columns from the merge dataframe. So I think it's probably easier to just REPLACE values in df$city based on "city.csv".
EDIT: Here's a more detailed look at my dataframe
> df[1:5]
id owner city state
1 AAAAA BOSTN MA
2 BBBBB LOS ANGELOS CA
3 CCCCC NYC NY
4 DDDDD CHICAGOO IL
5 EEEEE BOSTON MA
6 FFFFF SEATTLE WA
7 GGGGG NEW YORK CITY NY
8 HHHHH LOS ANGELES CA
If I use merge or cbind won't it just create another column at the end of my dataframe like this:
> merge()
id owner city state city_correct
1 AAAAA BOSTN MA BOSTON
2 BBBBB LOS ANGELOS CA LOS ANGELES
3 CCCCC NYC NY NEW YORK CITY
4 DDDDD CHICAGOO IL CHICAGO
5 EEEEE BOSTON MA
6 FFFFF SEATTLE WA
7 GGGGG NEW YORK CITY NY
8 HHHHH LOS ANGELES CA
So the cities with misspelling will be corrected, but the cities that are spelled correctly will be left out. What I want in the end is one column that has all the corrected city names.
merge. Do aleft_join,right_join(depending on how you arrange the datasets) followed bycoalesce. just don't do an inner join. recall join is similar to merge