1

I have a DataFrame such as:

     tag1   other
0    a,c      foo
1    b,c      foo
2    d        foo
3    a,a      foo

Of which the entries are strings delimited by commas.

And a dict of definitions for each tag such as:

dict = {'a' : 'Apple',
'b' : 'Banana',
'c' : 'Carrot'}

I would like to replace the definitions of a, b, and c but delete rows in which there is something not within that dict (i.e. d). Furthermore, I'd like to ensure there are no duplicates, such as row index 3 in the example dataset.

What I have so far:

df.tags = df.tags.str.split(',')
for index, row in df.iterrows():
    names = []
    for tag in row.tag1:
            if tag == dict[tag]:
                names.append(dict[tag])
            else:
                 df.drop(df.index[index])

From there I would replace the original column with the values in names. To replace duplicates, I am thinking of iterating over the array and checking if the next value matches the next, and if so, deleting it. However, this is not working and I am a bit stumped. The desired output would look like (with strings in unicode):

     tag1                     other
0    ['Apple', 'Carrot']      foo
1    ['Banadn', 'Carrot']     foo
3    ['Apple']                foo
2
  • what does the desired output look like? Commented May 31, 2017 at 19:40
  • I have edited that in, thanks. Commented May 31, 2017 at 19:44

1 Answer 1

4

For my entry into the longest one liner competition

m = {
    'a' : 'Apple',
    'b' : 'Banana',
    'c' : 'Carrot'
}

df.tag1.str.split(',', expand=True) \ 
  .stack().map(m).groupby(level=0) \
  .filter(lambda x: x.notnull().all()) \
  .groupby(level=0).apply(lambda x: x.drop_duplicates().str.cat(sep=',')) \
  .to_frame('tag1').join(df.other)

            tag1 other
0   Apple,Carrot   foo
1  Banana,Carrot   foo
3          Apple   foo

But seriously, probably a better solution

a = np.core.defchararray.split(df.tag1.values.astype(str), ',')
lens = [len(s) for s in a]
b = np.concatenate(a)
c = [m.get(k, np.nan) for k in b]
i = df.index.values.repeat(lens)
s = pd.Series(c, i)

def proc(x):
    if x.notnull().all():
        return x.drop_duplicates().str.cat(sep=',')

s.groupby(level=0).apply(proc).dropna().to_frame('tag1').join(df.other)

            tag1 other
0   Apple,Carrot   foo
1  Banana,Carrot   foo
3          Apple   foo
Sign up to request clarification or add additional context in comments.

7 Comments

Now, that is some high kicking fruit-fu!
@DmitryPolonskiy if high score wins... then sure :-)
I'm printing this out and putting this above my work station to cherish this. Thanks!
It appears that I am only getting one definition per entry in tags1 with this code even if it contains many tags.
@Kam are your definitions as you represented them? Or do they include integers?
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.