0

I have the following df with string col (small example, origin has more col & rows):

p = pd.DataFrame([
                   {'ID': 1,'col_1': 'pluto', 'col_2':'saturn,neptune,uranus,saturn,eris,haumea', 'col_3':'saturn,neptune,uranus,haumea,makemake,ceres','col_4':'mars,venus,planet x,earth','col_5':'sun'}, 
                   {'ID': 2,'col_1': 'sun, earth', 'col_2':'earth,venus,,jupyter,bennu,apophis', 'col_3':'bennu,apophis,vesta,eros,didymos','col_4':'earth,venus,other,hale-bopp','col_5':'sun'}, 
                   {'ID': 3,'col_1': 'saturn', 'col_2':'oumuamua,g1,tempel', 'col_3':'saturn','col_4':'mars','col_5':"['saturn']"},
                   {'ID': 4,'col_1': 'mercury, itokawa, venus, earth', 'col_2':'mercury,venus,itokawa', 'col_3':'mercury,itokawa,saturn','col_4':'venus,other,mars,earth','col_5':'sun'},
                   {'ID': 5,'col_1': 'saturn', 'col_2':'saturn', 'col_3':'saturn','col_4':'mars,other','col_5':'sun'}
                  ])

If a value in col_1 matches a value in col_2 - col_5, write value of col_1 in new col, but if value already found let it unique in new col. How do I achieve this?

this matches only where is one value, but not multiple values:


mask = p[p.columns[2:6]].isin(p['col_1']).any(1)
# if value of col_1 is in col_2,col_3,col_4,col_5 write matching value in col_6, else xx
p['col_1'] = np.where(mask, p['col_1'], 'xx')

expected output in col_6:


p_new = pd.DataFrame([
                   {'ID': 1,'col_1': 'pluto', 'col_2':'saturn,neptune,uranus,saturn,eris,haumea', 'col_3':'saturn,neptune,uranus,haumea,makemake,ceres','col_4':'mars,venus,planet x,earth','col_5':'sun','col_6':'xx'}, 
                   {'ID': 2,'col_1': 'sun, earth', 'col_2':'earth,venus,,jupyter,bennu,apophis', 'col_3':'bennu,apophis,vesta,eros,didymos','col_4':'earth,venus,other,hale-bopp','col_5':'sun','col_6':'earth,sun'}, 
                   {'ID': 3,'col_1': 'saturn', 'col_2':'oumuamua,g1,tempel', 'col_3':'saturn','col_4':'mars','col_5':"['saturn']",'col_6':'saturn'},
                   {'ID': 4,'col_1': 'mercury, itokawa, venus, earth', 'col_2':'mercury,venus,itokawa', 'col_3':'mercury,itokawa,saturn','col_4':'venus,other,mars,earth','col_5':'sun','col_6':'mercury,itokawa,venus,earth', },
                   {'ID': 5,'col_1': 'saturn', 'col_2':'saturn', 'col_3':'saturn','col_4':'mars,other','col_5':'sun','col_6':'saturn'}
                  ])
3
  • what is the expected output? Commented May 18, 2022 at 12:35
  • I don't see an edit Commented May 18, 2022 at 12:43
  • @mozway, I was to slow, now? Commented May 18, 2022 at 12:44

1 Answer 1

1

You can convert the values to set

df = pd.DataFrame({'col1': p['col_1'].str.split(', ?').apply(set),
                   'col2': p.filter(regex='col_[2-5]').agg(','.join, axis=1).str.split(',').apply(set)})
print(df)

                               col1  \
0                           {pluto}
1                      {sun, earth}
2                          {saturn}
3  {venus, mercury, earth, itokawa}
4                          {saturn}

                                                                                          col2
0  {makemake, uranus, ceres, saturn, mars, sun, planet x, venus, earth, eris, haumea, neptune}
1       {, jupyter, eros, sun, vesta, bennu, other, venus, apophis, earth, hale-bopp, didymos}
2                                             {oumuamua, mars, g1, saturn, ['saturn'], tempel}
3                                   {saturn, mars, sun, other, venus, earth, mercury, itokawa}
4                                                                   {sun, other, mars, saturn}

Then find the intersection part and convert it back to string

p['col_6'] = df.apply(lambda row: ','.join(row['col1'] & row['col2']), axis=1)
print(p)

   ID                           col_1  \
0   1                           pluto
1   2                      sun, earth
2   3                          saturn
3   4  mercury, itokawa, venus, earth
4   5                          saturn

                                      col_2  \
0  saturn,neptune,uranus,saturn,eris,haumea
1        earth,venus,,jupyter,bennu,apophis
2                        oumuamua,g1,tempel
3                     mercury,venus,itokawa
4                                    saturn

                                         col_3                        col_4  \
0  saturn,neptune,uranus,haumea,makemake,ceres    mars,venus,planet x,earth
1             bennu,apophis,vesta,eros,didymos  earth,venus,other,hale-bopp
2                                       saturn                         mars
3                       mercury,itokawa,saturn       venus,other,mars,earth
4                                       saturn                   mars,other

        col_5                        col_6
0         sun
1         sun                    sun,earth
2  ['saturn']                       saturn
3         sun  venus,mercury,earth,itokawa
4         sun                       saturn
Sign up to request clarification or add additional context in comments.

7 Comments

I guess, in my original df there's something wrong with my dtypes, I get the error TypeError: unsupported operand type(s) for &: 'str' and 'str' . Anyway, if I try it with the minimal example it works.
@iittala Using your original dataframe, after generating df, are the output of df['col1'].apply(type).eq(set) is all true for both col1 and col2?
yes, they both are true. I got rid of this error, but now it's still only one value in col_6, even if there are more/multiple matching values. It appears that the first value from col_1 is written to col_6 if there is a matching value
@iittala If so, can you ensure that set in col1 and col2 contains the desired separated value from original col_1 to col_5?
Yes, in col1 are the values of col_1 and in col2 are the values of col_2 - col_5. The original df are with <class 'list'>, is this maybe easier, working with these? I changed the dtypes
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.