0

I'm currently working on generating synthetic. I have 2 dataframes. the first dataframe has 150 records of occupation type and the associated degree

df1

Occupation         Degree
Biologist          Masters   
Cleaner            High_School
Office Manager     Bachelor
Software Eng.      Bachelor
Data Scientist     Phd
....
Data Scientist     Masters

The other one is the main dataframe with about 100K records,

main df:

Name         Degree
John         Masters   
Paul         High_School
Mary         Bachelor
Joseph       Bachelor
Moses        Phd
....
Helen        Masters

I want to use the first df to assign occupation to the main dataframe base on the degree the individual has, but the degrees column on both dataframe are not unique.

Is there way in R to merge to dataframes without unique keys?

2
  • 1
    You can join but no one here can tell you if Joseph and Mary are Office Manager or Software Eng. Same for John i.e. is he Biologist or Data Scientist? There must be some additional info that you haven't pulled. Commented Sep 8, 2019 at 14:06
  • For your sample data above, what would you give as the answer. Commented Sep 8, 2019 at 14:17

1 Answer 1

1

Use the data shown reproducibly in the Note at the end. If there are multiple matches to a degree we can't know, in the absence of other information, which occupation to use but we could list them all or take one of them arbitrarily. We will use the first approach. Below Occupation is a character column but if we wished we could use c in place of toString in which case it would be a list of character vectors.

m <- merge(main, df1, by = "Degree", all.x = TRUE)
aggregate(Occupation ~ Name + Degree, m, toString)

giving:

    Name      Degree                    Occupation
1 Joseph    Bachelor Office_Manager, Software_Eng.
2   Mary    Bachelor Office_Manager, Software_Eng.
3   Paul High_School                       Cleaner
4   John     Masters     Biologist, Data_Scientist
5  Moses         Phd                Data_Scientist

Note

Lines1 <- "Occupation         Degree
Biologist          Masters   
Cleaner            High_School
Office_Manager     Bachelor
Software_Eng.      Bachelor
Data_Scientist     Phd
Data_Scientist     Masters"

Lines.main <- "Name         Degree
John         Masters   
Paul         High_School
Mary         Bachelor
Joseph       Bachelor
Moses        Phd"

df1 <- read.table(text = Lines1, header = TRUE, as.is = TRUE)
main <- read.table(text = Lines.main, header = TRUE, as.is = TRUE)
Sign up to request clarification or add additional context in comments.

1 Comment

Yes, some information are missing which are not available to me. However, aggregating the occupation per degree type and then assigning to each visa type using sample(N, splitstr(df$Occupation). thanks

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.