Pandas look for string values of one col in multiple columns, write value if in other col in new column

Question

I have the following df with string col (small example, origin has more col & rows):

p = pd.DataFrame([
                   {'ID': 1,'col_1': 'pluto', 'col_2':'saturn,neptune,uranus,saturn,eris,haumea', 'col_3':'saturn,neptune,uranus,haumea,makemake,ceres','col_4':'mars,venus,planet x,earth','col_5':'sun'}, 
                   {'ID': 2,'col_1': 'sun, earth', 'col_2':'earth,venus,,jupyter,bennu,apophis', 'col_3':'bennu,apophis,vesta,eros,didymos','col_4':'earth,venus,other,hale-bopp','col_5':'sun'}, 
                   {'ID': 3,'col_1': 'saturn', 'col_2':'oumuamua,g1,tempel', 'col_3':'saturn','col_4':'mars','col_5':"['saturn']"},
                   {'ID': 4,'col_1': 'mercury, itokawa, venus, earth', 'col_2':'mercury,venus,itokawa', 'col_3':'mercury,itokawa,saturn','col_4':'venus,other,mars,earth','col_5':'sun'},
                   {'ID': 5,'col_1': 'saturn', 'col_2':'saturn', 'col_3':'saturn','col_4':'mars,other','col_5':'sun'}
                  ])

If a value in col_1 matches a value in col_2 - col_5, write value of col_1 in new col, but if value already found let it unique in new col. How do I achieve this?

this matches only where is one value, but not multiple values:


mask = p[p.columns[2:6]].isin(p['col_1']).any(1)
# if value of col_1 is in col_2,col_3,col_4,col_5 write matching value in col_6, else xx
p['col_1'] = np.where(mask, p['col_1'], 'xx')

expected output in col_6:


p_new = pd.DataFrame([
                   {'ID': 1,'col_1': 'pluto', 'col_2':'saturn,neptune,uranus,saturn,eris,haumea', 'col_3':'saturn,neptune,uranus,haumea,makemake,ceres','col_4':'mars,venus,planet x,earth','col_5':'sun','col_6':'xx'}, 
                   {'ID': 2,'col_1': 'sun, earth', 'col_2':'earth,venus,,jupyter,bennu,apophis', 'col_3':'bennu,apophis,vesta,eros,didymos','col_4':'earth,venus,other,hale-bopp','col_5':'sun','col_6':'earth,sun'}, 
                   {'ID': 3,'col_1': 'saturn', 'col_2':'oumuamua,g1,tempel', 'col_3':'saturn','col_4':'mars','col_5':"['saturn']",'col_6':'saturn'},
                   {'ID': 4,'col_1': 'mercury, itokawa, venus, earth', 'col_2':'mercury,venus,itokawa', 'col_3':'mercury,itokawa,saturn','col_4':'venus,other,mars,earth','col_5':'sun','col_6':'mercury,itokawa,venus,earth', },
                   {'ID': 5,'col_1': 'saturn', 'col_2':'saturn', 'col_3':'saturn','col_4':'mars,other','col_5':'sun','col_6':'saturn'}
                  ])

what is the expected output?

mozway
– mozway

2022-05-18 12:35:54 +00:00
Commented May 18, 2022 at 12:35 — mozway
– mozway, Commented May 18, 2022 at 12:35
I don't see an edit

mozway
– mozway

2022-05-18 12:43:12 +00:00
Commented May 18, 2022 at 12:43 — mozway
– mozway, Commented May 18, 2022 at 12:43
@mozway, I was to slow, now?

iittala
– iittala

2022-05-18 12:44:23 +00:00
Commented May 18, 2022 at 12:44 — iittala
– iittala, Commented May 18, 2022 at 12:44

Ynjxsjmh · Accepted Answer · 2022-05-18 14:29:50Z

1

You can convert the values to set

df = pd.DataFrame({'col1': p['col_1'].str.split(', ?').apply(set),
                   'col2': p.filter(regex='col_[2-5]').agg(','.join, axis=1).str.split(',').apply(set)})

print(df)

                               col1  \
0                           {pluto}
1                      {sun, earth}
2                          {saturn}
3  {venus, mercury, earth, itokawa}
4                          {saturn}

                                                                                          col2
0  {makemake, uranus, ceres, saturn, mars, sun, planet x, venus, earth, eris, haumea, neptune}
1       {, jupyter, eros, sun, vesta, bennu, other, venus, apophis, earth, hale-bopp, didymos}
2                                             {oumuamua, mars, g1, saturn, ['saturn'], tempel}
3                                   {saturn, mars, sun, other, venus, earth, mercury, itokawa}
4                                                                   {sun, other, mars, saturn}

Then find the intersection part and convert it back to string

p['col_6'] = df.apply(lambda row: ','.join(row['col1'] & row['col2']), axis=1)

print(p)

   ID                           col_1  \
0   1                           pluto
1   2                      sun, earth
2   3                          saturn
3   4  mercury, itokawa, venus, earth
4   5                          saturn

                                      col_2  \
0  saturn,neptune,uranus,saturn,eris,haumea
1        earth,venus,,jupyter,bennu,apophis
2                        oumuamua,g1,tempel
3                     mercury,venus,itokawa
4                                    saturn

                                         col_3                        col_4  \
0  saturn,neptune,uranus,haumea,makemake,ceres    mars,venus,planet x,earth
1             bennu,apophis,vesta,eros,didymos  earth,venus,other,hale-bopp
2                                       saturn                         mars
3                       mercury,itokawa,saturn       venus,other,mars,earth
4                                       saturn                   mars,other

        col_5                        col_6
0         sun
1         sun                    sun,earth
2  ['saturn']                       saturn
3         sun  venus,mercury,earth,itokawa
4         sun                       saturn

answered May 18, 2022 at 14:29

Ynjxsjmh

30.3k7 gold badges43 silver badges64 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

iittala Over a year ago

I guess, in my original df there's something wrong with my dtypes, I get the error TypeError: unsupported operand type(s) for &: 'str' and 'str' . Anyway, if I try it with the minimal example it works.

Ynjxsjmh Over a year ago

@iittala Using your original dataframe, after generating df, are the output of df['col1'].apply(type).eq(set) is all true for both col1 and col2?

iittala Over a year ago

yes, they both are true. I got rid of this error, but now it's still only one value in col_6, even if there are more/multiple matching values. It appears that the first value from col_1 is written to col_6 if there is a matching value

Ynjxsjmh Over a year ago

@iittala If so, can you ensure that set in col1 and col2 contains the desired separated value from original col_1 to col_5?

iittala Over a year ago

Yes, in col1 are the values of col_1 and in col2 are the values of col_2 - col_5. The original df are with <class 'list'>, is this maybe easier, working with these? I changed the dtypes

|

Collectives™ on Stack Overflow

Pandas look for string values of one col in multiple columns, write value if in other col in new column

1 Answer 1

7 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

7 Comments

Your Answer

Sign up or log in

Post as a guest

Related