2

I have the following problem. I have a dataframe with several columns, one of those contains strings as values. I want to loop through this column, change those values and save the changed values in a new column.

The code I have written so far looks like this:

def get_classes(x):    
    for index, string in df['column'].iteritems():
        listi = string.split(',')
        Classes=[]

        for value in listi:
            count=listi.count(value)
            if count >= 3: 
                Classes.append(value)

        Unique=(',').join(sorted(list(set(Classes))))
        df['NewColumn']=Unique


End.apply(get_classes)

It loops through the rows of df['column'], splitting the string at each ,(creating a list called listi) and creates an empty list called classes. It then counts each value in listi and appends it to Classes if it occures at least three times in the list. The finished list is then sorted and set(), so that all objects in the list are unique, and finally joined at comma to a string again. Then I want to append this unique list of value in a new column, at the same index position as the row value the changed value is derived from. As example:

df
  column    NewColumn
0 A,A,A,C   A 
1 C,B,C,C   C
2 B,B,B,B   B

My code seems to work fine when I do print Unique instead of df['NewColumn']=Unique, as it then prints all the transformed values. If I execute the code like in my example however, the NewColumn of the dataframe is completely filled with the same value, which seems to correspond to the original value of the last row in the df. Can someone explain to me what the problem here is?

1
  • there are issues on indexing, by looking at your code, you try at each iteration to add a column named 'new column' with value from Unique ... so this column is overwritten and overwritten for each row ...this is why you have the same value from the last row... Commented Dec 2, 2015 at 10:53

1 Answer 1

2

You can use powerfull Counter from Collections:

from collections import Counter

foo = lambda x: ','.join(sorted([k for k,v in Counter(x).iteritems() if v>=3]))

df['new'] = df['column'].str.split(',').map(foo)


#In [33]: df
#Out[33]:
#    column NewColumn new
#0  A,A,A,C         A   A
#1  C,B,C,C         C   C
#2  B,B,B,B         B   B
Sign up to request clarification or add additional context in comments.

2 Comments

Thank you, this works fine. But do you have any idea why my code does not work the way I want it to work/ what I should change for it to work?
I strongly recommend you to use this Counter since you decouple the function itself from the loop on the dataframe (easy for unit tests on the function) and ... it's also ... neater/easier to understand: 2 lines.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.