Replacing strings in Pandas DataFrame column with array of entries based on dict

Question

I have a DataFrame such as:

     tag1   other
0    a,c      foo
1    b,c      foo
2    d        foo
3    a,a      foo

Of which the entries are strings delimited by commas.

And a dict of definitions for each tag such as:

dict = {'a' : 'Apple',
'b' : 'Banana',
'c' : 'Carrot'}

I would like to replace the definitions of a, b, and c but delete rows in which there is something not within that dict (i.e. d). Furthermore, I'd like to ensure there are no duplicates, such as row index 3 in the example dataset.

What I have so far:

df.tags = df.tags.str.split(',')
for index, row in df.iterrows():
    names = []
    for tag in row.tag1:
            if tag == dict[tag]:
                names.append(dict[tag])
            else:
                 df.drop(df.index[index])

From there I would replace the original column with the values in names. To replace duplicates, I am thinking of iterating over the array and checking if the next value matches the next, and if so, deleting it. However, this is not working and I am a bit stumped. The desired output would look like (with strings in unicode):

     tag1                     other
0    ['Apple', 'Carrot']      foo
1    ['Banadn', 'Carrot']     foo
3    ['Apple']                foo

what does the desired output look like?

spies006
– spies006

2017-05-31 19:40:53 +00:00
Commented May 31, 2017 at 19:40 — spies006
– spies006, Commented May 31, 2017 at 19:40
I have edited that in, thanks.

kalle
– kalle

2017-05-31 19:44:04 +00:00
Commented May 31, 2017 at 19:44 — kalle
– kalle, Commented May 31, 2017 at 19:44

piRSquared · Accepted Answer · 2017-05-31 20:09:46Z

4

For my entry into the longest one liner competition

m = {
    'a' : 'Apple',
    'b' : 'Banana',
    'c' : 'Carrot'
}

df.tag1.str.split(',', expand=True) \ 
  .stack().map(m).groupby(level=0) \
  .filter(lambda x: x.notnull().all()) \
  .groupby(level=0).apply(lambda x: x.drop_duplicates().str.cat(sep=',')) \
  .to_frame('tag1').join(df.other)

            tag1 other
0   Apple,Carrot   foo
1  Banana,Carrot   foo
3          Apple   foo

But seriously, probably a better solution

a = np.core.defchararray.split(df.tag1.values.astype(str), ',')
lens = [len(s) for s in a]
b = np.concatenate(a)
c = [m.get(k, np.nan) for k in b]
i = df.index.values.repeat(lens)
s = pd.Series(c, i)

def proc(x):
    if x.notnull().all():
        return x.drop_duplicates().str.cat(sep=',')

s.groupby(level=0).apply(proc).dropna().to_frame('tag1').join(df.other)

            tag1 other
0   Apple,Carrot   foo
1  Banana,Carrot   foo
3          Apple   foo

edited May 31, 2017 at 20:09

answered May 31, 2017 at 19:49

piRSquared

296k68 gold badges509 silver badges654 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Scott Boston Over a year ago

Now, that is some high kicking fruit-fu!

piRSquared Over a year ago

@DmitryPolonskiy if high score wins... then sure :-)

kalle Over a year ago

I'm printing this out and putting this above my work station to cherish this. Thanks!

kalle Over a year ago

It appears that I am only getting one definition per entry in tags1 with this code even if it contains many tags.

piRSquared Over a year ago

@Kam are your definitions as you represented them? Or do they include integers?

|

Collectives™ on Stack Overflow

Replacing strings in Pandas DataFrame column with array of entries based on dict

1 Answer 1

7 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

7 Comments

Your Answer

Sign up or log in

Post as a guest

Related