0

I currently have a Pandas column, on each row of the column there are multiple values. I would like to obtain a set of unique values in the whole column. For example:

From:

+-------------------------------------------+
|                  Column                   |
+-------------------------------------------+
| 300000,50000,500000,100000,1000000,200000 |
| 100000,1000000,200000,300000,50000,500000 |
|                                       ... |
+-------------------------------------------+

To:

+--------+
| Column |
+--------+
|  50000 |
| 100000 |
| 200000 |
| 300000 |
|    ... |
+--------+

Thank you very much

8
  • 1
    It is a single columns Try df['Column'].apply(set, 1)? Commented Nov 1, 2019 at 5:33
  • Does order matter? Commented Nov 1, 2019 at 5:34
  • @Chris the apply set gives me the set of each row. But it won't give me the unique value for all columns Commented Nov 1, 2019 at 5:36
  • @cs95 Order doesn't really matter. It can be sorted at the very end as part of the postprocessing procedure Commented Nov 1, 2019 at 5:36
  • So you mean you have multiple columns in a similar fashion? perhaps updating your sample data would make the question more understandable then :) Commented Nov 1, 2019 at 5:37

2 Answers 2

3

This:

>>> data = {'column' : ["300000,50000,500000,100000,1000000,200000","100000,1000000,200000,300000,50000,500000"]}
>>> df = pd.DataFrame(data)
>>> df.column.str.split(',').explode().astype(int).drop_duplicates().sort_values(ascending=True)

Outputs:

    column
0    50000
1   100000
2   200000
3   300000
4   500000
5  1000000
Sign up to request clarification or add additional context in comments.

5 Comments

Good Attempt @luigigi
@Vishnudev yes but it assumes the data to be a column of lists; OP says they're not. Good attempt, but still not completely right. (not the downvoter)
Not sure why I was downvoted. Just stating facts. :)
Yes, it should be data = {'column' : "300000,50000,500000,100000,1000000,200000","100000,1000000,200000,300000,50000,500000"}
Updated @Winston
1

Pure pandas solution should be slowier, if large data - idea is create Series by split and stack, remove duplicated, convert to integers and sorting:

df = (df['Column'].str.split(',', expand=True)
                  .stack()
                  .drop_duplicates()
                  .astype(int)
                  .sort_values()
                  .reset_index(drop=True)
                  .to_frame('col'))
print (df)
       col
0    50000
1   100000
2   200000
3   300000
4   500000
5  1000000

Or use set comprehension with flatten splitted lists, convert to integers, sorted and last pass to Dataframe - solution should be faster in large DataFrame:

#solution working if no missing values, no Nones
L = sorted(set([int(y) for x in df['Column'] for y in x.split(',')]))

#solution1 (working with NaN)s
L = sorted(set([int(y) for x in df['Column'] if x == x for y in x.split(',')]))

#solution2 (working with None)s
L = sorted(set([int(y) for x in df['Column'] if x != None for y in x.split(',')]))

#solution3 (working with NaN, None)s
L = sorted(set([int(y) for x in df['Column'] if pd.notna(x) for y in x.split(',')]))

df = pd.DataFrame({'col':L})
print (df)
       col
0    50000
1   100000
2   200000
3   300000
4   500000
5  1000000

8 Comments

I am not sure but I can't get the second one running
@Winston - Maybe some missing values, can you try L = sorted(set([int(y) for x in df['Column'] if x == x for y in x.split(',')])) ?
Yes you are right. I need to add dropna() at the end of df['Column']. Solution #3 works
Would you mind taking a look at the other question? stackoverflow.com/questions/58640228/… This question is just part of the goal I want to achieve (if grouping is the way)
I guess the groupby solution would be good enough. But I still have problem finding the right code to do that transformation
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.